14
11 Comments

How do you block the SEO spam bots?

I run an independent search engine as a bootstrapped side-project, and unfortunately almost all traffic (between 99.976% and 99.997%) is coming from SEO spam bots. This is negatively impacting a number of things, not least server utilisation and running costs.

I've been trying to block the requests at my nginx reverse proxy level, but some are starting to bypass the measures I've put in place now. I know there is Cloudflare, but ironically my search spider is currently blocked by Cloudflare so I don't want to go that route until they've unblocked my own spider.

Has anyone else had similar issues, and if so how have you dealt with it?

posted to Icon for group Developers
Developers
on May 11, 2022
  1. 2

    It's hard to block SEO spam bots. The best way to do it is by using a combination of different techniques, including:

    1. Make sure your website is up-to-date and secure, so that it's not vulnerable to attacks from bots or hackers.

    2. If you have a blog on your site, make sure it has an RSS feed, so that people can subscribe directly from their feed readers instead of having to visit your site directly every time they want to read something new from you. This also means that when someone subscribes to your blog via an RSS feed, they don't have access to any other part of your site—just the blog posts themselves—which reduces the chance that they'll get infected by malware or other viruses while browsing around in search of information about what kind of content might be relevant for them right now (like how many times someone searched for 'seo spam bots' last week).

    3. Use Google Analytics along with some other analytics tools like HotJar or Clicky Premium so that you can see where people are coming from when they come across your content online and what kinds of things they're looking for when they're trying out new websites for the first time.

    1. 1

      Many thanks for your comments. I actually have a fairly specific issue, which I should probably have made clear in the original post.

      My search engine appears to have been listed as a search engine in an SEO tool like ScrapeBox, GSA SEO or SEnuke, and now the spammers are using "scraping footprints" and proxy farms to search my search engine for lists of URLs to target. I'm now getting over 160,000 of these "scraping footprints" style searches per day. Now the big search engines can handle that sort of traffic no problem, but for a bootstrapped search engine like this it is too much, so I'm trying to block searches by the spam bots but not block searches by real users.

      Your mention of analytics made me wonder how my analytics solution (Plausible) is successfully filtering them out. It seems that they only count visitors that run JavaScript, which they say is "a decent proxy for 'this is probably a real human using a web browser'". So that could be one simple option to explore - only return search results if JavaScript is enabled. Although as per other comments, I'd prefer to block the requests earlier than that to try to keep the server load manageable.

  2. 2

    What do you think of reCAPTCHA v2 (the one that verifies if an interaction is legitimate with the “I am not a robot” checkbox)?

    If a user is interested in a website, he will be ok to have the little inconvenience of having to click on it.

    1. 2

      Thanks for your suggestion.

      Cons:

      • If I were to need the Enterprise edition it would cost - at the moment I'm getting around 80,000 searches a day from the SEO spam bots, so I'd significantly exceed the 1,000,000 calls per month of the free tier.
      • It does catch the bots rather late in the day - I'll already have burned up more server resources getting to that point - at the moment I'm trying to block them at the reverse proxy level to try to minimise costs.
      • I've gone for a super simple design - the home page is basically just a simple search box (a bit like the Google search page) - and I think I'd need a big reCaptcha logo next to the search box if I were to use it.

      Pros:

      • There is an "invisible" mode which only pops up the captcha if the client is suspicious. I didn't know that until reading the docs just now, so I've learned something new - thanks for that.
      1. 1

        You should check out hCaptcha where they actually pay the website owner wheb people solve it, a small amount but at least you don't have to pay.
        As an user I find it less annoying than recaptch.

        1. 1

          Thanks. hCAPTCHA looks good, although the "no CAPTCHA" mode is only available in the paid-for Enterprise Mode, and given I'm now on over 160,000 searches a day by the spam bots it is likely to be way too expensive for a bootstrapped free-to-use search engine.

      2. 1

        Interesting.

        It's easy to fall in the "free tier trap", to implement a solution only to realize afterward that you are going to exceed the limit and going bankrupt.

        I have tried the invisible mode on a Drupal website, and it has never worked against spam bots.

  3. 1

    I'm glad you asked!

    SEO spam bots can be a real problem for businesses, especially as we see more and more companies struggle to adopt new SEO practices that are better aligned with consumer behavior.

    One thing you can do is use the Disavow Tool from Google. This tool allows you to submit a list of URLs that you don't want Google to index, and they will then treat them as if they don't exist. It's a great way to get rid of low-quality sites from your site's index so that they don't show up in search results.

    1. 1

      Thanks. The issue is that my site is a search engine and the problem traffic is the spammers using it to perform searches for URLs to target, in the same way they search Google for URLs to target. All the sites I have in my search index are ones I want to keep, and I want to allow searching since that is the point of a search engine - I just want to block searching by the spam bots.

  4. 1

    I've been working with SEO spam bots for years now, and I can say that they're actually pretty easy to block.

    The first thing you have to do is have a clear understanding of what the bots know and don't know, so you can use that knowledge to your advantage. For instance, they know how many links you have pointing at your site, but they don't know what those links are or where they lead. They also know how many times your site has been shared on social media sites like Facebook and Twitter, but they don't actually see the content of those shares.

    With this information in hand, you can begin blocking them by creating fake content that looks real enough for them to believe it's part of your website but doesn't really exist in any meaningful way. This will cause them to waste time trying to access it without realizing that it's not actually there and will keep them from accessing actual content on your site as well.

  5. 1

    I can confidently say that the best way to block SEO spam bots is by using a combination of methods.

    First, you should use a CAPTCHA for your contact form or any other area where users are required to enter information. This prevents bots from collecting data on your site.

    Second, you should use the Noindex tag on pages that aren't meant to be indexed by search engines. This helps ensure that only pages with content that's actually relevant to visitors will appear in search results.

    Finally, make sure that your website has a responsive design so it displays correctly on all devices. This will help you avoid getting penalized by Google for having mobile-unfriendly pages.

Trending on Indie Hackers
I wasted 6 months building a failed startup. Built TrendyRevenue to validate ideas in 10 seconds. User Avatar 50 comments Agencies charge $5,000 for a 60-second product demo video. I make mine for $0. Here's the exact workflow. User Avatar 38 comments Your files aren’t messy. They’re just stuck in the wrong system. User Avatar 27 comments Built a tool that finds which Reddit/HN threads are making ChatGPT recommend your competitors User Avatar 22 comments Why Direction Matters More Than Motivation in Exam Preparation User Avatar 14 comments A Closer Look at Droven .io Artificial Intelligence User Avatar 8 comments