Ideas and Validation February 13, 2020

Bot, or not?

Alister Bulman @alister

One of the frequent banes of my life as a sysadmin, and developer, that runs his own servers, are all the bots that crawl and outright attack my servers. I wondered - who else finds it takes time and effort to try to avoid the comment-form spam, and all the other work involved in keeping things clean and working.

So, if you do run servers, I'd like your thoughts of this (currently skeleton) idea. I'll be filling in some more details in the next few days as to what, and how it does it.

https://BotRegistry.github.io

Alister

  1. 1

    I've thought about various forms of this (identifying spam or malicious anything, I come from an insurance compliance background). The challenge I always run into is that as soon as you offer a filter as a service and incoming data/definitions it becomes possible for a malicious user to abuse your service.
    Worst case scenario: you become massively popular and everyone relies on your service to identify spammers. Malicious users identify this and send a bunch of false reports that XYZ is a spammer. Now XYZ has been blocked by outside, malicious entities.

    As @ehacke said, I think this is valuable but challenging.

    1. 2

      The original tweet that gave me the idea was about a reporting endpoint for problems, but I saw the problem you point out about false or malicious reports - so I've been spending my time instead thinking of how I would independently make decisions about a bot/IP's 'goodness'.

      For example - the GoogleBot Webcrawler would be (usually) good and wanted, but if someone faked that useragent (utterly trivial to do) then it's an immediate black mark against that visitor.

  2. 1

    I think if you could do it, it would be valuable.

    But, it feels like if this was something that WAS doable, it would already exist.

    I think most websites rely mostly on heuristics and behavioural data to do this stuff, and that most of that is pretty specific to an individual site.

    That said, one way you may be able to make this work is to target specific platforms and cater to that narrower problem. Then you don't have to solve bots everywhere, just on Etsy (for example).

    1. 1

      They do exist - though mostly as external proxies (the Cloudflare-style model) or in-cluster appliances for enterprises (thought they don't list prices). I found one on ProductHunt for example, called ShieldSquare, it was later bought out by a larger security company.

      Other than rolling your own solution (usually with simple user-agent comparisons), or firewall blocklists (often on the country level), there's nothing for the small & medium sized websites to provide even the first 20% of a solution without some significant developer time and effort.

      As Cloudflare points out, it's something that can also get more interesting at scale, where you can then detect the same (range of) IPs going to multiple, disparate websites over time.

Recommended Posts