10
8 Comments

How do you fight SPAM?

Hey mates, I've spent a considerable amount of time fighting SPAM during the last few months. I'd be happy to exchange ideas with you. For example, what is the case and what measures have you taken.

Here they are my examples:

Case one. Users can submit their relevant stories to LibHunt. Those stories will appear on the main feed and will be considered for inclusion in the upcoming newsletter. Apparently, a registration process is not a big enough deterrent for spammers.

My measures:

  • Introduce Google's reCaptcha (this definitely decreased the SPAM entries yet not
    removed them)

  • Introduced a Recursive Bayesian Filter to score spam level based on stories content and marking them as SPAM/no-spam. This is imperfect; however, it helps to the extend that suspicious entries never reach the feed. This disincentivizes submission and has lead to a decrease of SPAM submissions, yet it is not perfect, and I'm spending something like half an hour a week in reviewing stories. I will most probably soon rewrite this to use the ML technique described in "case two" bellow as that seems to be much more effective.

Case two. SaaSHub spam in the reviews and comments sections, as well as submitting irrelevant websites as software products. As links on SaaSHub are do-follow, that makes it an enticing field for spammers. In fact, the amount of SPAM on SaaSHub is times more than that on LibHunt.

My measures:

  • Introduce a simple honeypot with a hidden "company_name" field. I took this idea and advice from StarterStory. That was very effective. To be honest, it introduced a big win with almost no effort and eradicated all automated bot SPAM.

  • Introduce a manual approval system for all: Products / Reviews / Comments. All submitted products go through a manual approval process. Reviews and Comments approvals are semi-automated.

  • Introduce some machine learning - LightGBM - A fast, distributed, high-performance gradient boosting. Based on a number of features specific to each model and historical manual approval s and rejections, I have an "approve score" for each Product/Review/Comment. It is much more complex than the honeypot example above, but it is very effective. I've introduced this technique recently, and it has helped me decrease manual work from 2 to 10 times. What is more, "shady" and questionable reviews and not published automatically anymore. I'm aware that no ML is or will be perfect, but I'm quite satisfied with it so far. What is more, it felt like a nice tech adventure and opened the door for more automation 🤖

What is your experience?

  1. 4

    Most spammers on IH are humans. They're crazy persistent, and they tend to post at regular times each day. I'm not sure what their deal is. We delete their spam pretty quickly, so they're not in it for the traffic. We also add rel="nofollow" to links, so it's not for SEO juice. My guess is that they're employees of cheap firms paid by clients to post spam in various places, and they're just going through the motions.

    • Consequently, the honeypot field doesn't work, because these are humans who can't even see it.
    • We require products to go through a manual approval system, but that's not enough, because almost all spam comes in the form of posts to the site.
    • Leveraging the community. We get pinged on Slack whenever you report a post on IH, and we can usually respond quite quickly.
    • Some simple rules and scoring mechanisms on the back end that will shadow ban overly suspicious accounts.
    • Manual review. @rosiesherry has done some monumental work here, sometimes cleaning up thousands of spammer accounts in a single week.
    • Similar checks on the front end that will simply prevent making suspicious posts in the first place, especially for brand new accounts that try to make posts with links in them. This has been extremely effective in recent weeks.
    • Looking at the initial referrer. Lots of spammers come directly from sites like this, because IH is just one of the many sites they spam daily.
    • We plug into the StopForumSpam.com API and contribute to it as well, but to be honest, it's not effective.

    The best future methods will probably involve more restrictions on brand new accounts. Basic stuff like requiring email verification to post, or requiring point thresholds to post links, etc.

    1. 1

      What are some automatic solutions to stop spam where the content/audience is not in english language? Because I understand that these automatic solutions are based on languages rules..

      1. 1

        If you don't know the language being spoken… I have no idea :-D

    2. 1

      Thanks @csallen. I've considered using StopForumSpam, AbuseIPDB or OopSPAM. I will most probably give a try to one of those as well.

  2. 2

    Depending on spam type and reasoning...
    Many are either bot or low pay human work..
    The incentive is usually some kind of marketing, getting backlinks for example...

    • Honey pots, personal and global like projecthoneypot are very useful for generic forums..
    • Depending on crowd type, and how aggressive you feel ok with vs possibility of limiting some access, one can block huge lists of Cloud/Server hosts etc... (bots often use these.. while most users are on ISP connections)
    • Can look up site stats to make specific decisions, like a specific IP/User agent/Request type.. (is your site a specific location and all spam is from specific countries?.. does the spam come from a browser version that is >10 years old and non of your normal users are? or just obviously marked as a non browser user agent?)
    • A bit more work that reduces bots and is ok for users is putting a unique hidden code for the form from the server and checking for it on submit.
    • Some of the best blocks are user behaviour, like blocking wanted results... so if I was working on IH, I'd just discard all posts that put a URL in the message title. Other behaviour based is like some forums requiring you to get points before getting the privilege to post or add a link etc..
    • Traffic / speed limiting... you might see the same ip will post a lot, while it might not make sense a normal user would post more then X in Y time.
    • Non immediate response... commonly posts are either immediate or manually reviewed, putting a delay might make most assume they are manually reviewed, and would drastically reduce attempts.
    1. 1

      Nice list. Also, I haven't put in use "User Agent" info yet. But it makes sense that it could be useful.

  3. 1

    are the machine learning / Recursive Bayesian Filter fit cases where the website / audience language is not english?

    1. 1

      you should be able to use it with the same success on non English data

Trending on Indie Hackers
How I grew a side project to 100k Unique Visitors in 7 days with 0 audience 49 comments Competing with Product Hunt: a month later 33 comments Why do you hate marketing? 29 comments My Top 20 Free Tools That I Use Everyday as an Indie Hacker 18 comments $15k revenues in <4 months as a solopreneur 14 comments Use Your Product 13 comments