Hey mates, I've spent a considerable amount of time fighting SPAM during the last few months. I'd be happy to exchange ideas with you. For example, what is the case and what measures have you taken.
Here they are my examples:
Case one. Users can submit their relevant stories to LibHunt. Those stories will appear on the main feed and will be considered for inclusion in the upcoming newsletter. Apparently, a registration process is not a big enough deterrent for spammers.
My measures:
Introduce Google's reCaptcha (this definitely decreased the SPAM entries yet not
removed them)
Introduced a Recursive Bayesian Filter to score spam level based on stories content and marking them as SPAM/no-spam. This is imperfect; however, it helps to the extend that suspicious entries never reach the feed. This disincentivizes submission and has lead to a decrease of SPAM submissions, yet it is not perfect, and I'm spending something like half an hour a week in reviewing stories. I will most probably soon rewrite this to use the ML technique described in "case two" bellow as that seems to be much more effective.
Case two. SaaSHub spam in the reviews and comments sections, as well as submitting irrelevant websites as software products. As links on SaaSHub are do-follow, that makes it an enticing field for spammers. In fact, the amount of SPAM on SaaSHub is times more than that on LibHunt.
My measures:
Introduce a simple honeypot with a hidden "company_name" field. I took this idea and advice from StarterStory. That was very effective. To be honest, it introduced a big win with almost no effort and eradicated all automated bot SPAM.
Introduce a manual approval system for all: Products / Reviews / Comments. All submitted products go through a manual approval process. Reviews and Comments approvals are semi-automated.
Introduce some machine learning - LightGBM - A fast, distributed, high-performance gradient boosting. Based on a number of features specific to each model and historical manual approval s and rejections, I have an "approve score" for each Product/Review/Comment. It is much more complex than the honeypot example above, but it is very effective. I've introduced this technique recently, and it has helped me decrease manual work from 2 to 10 times. What is more, "shady" and questionable reviews and not published automatically anymore. I'm aware that no ML is or will be perfect, but I'm quite satisfied with it so far. What is more, it felt like a nice tech adventure and opened the door for more automation 🤖
What is your experience?
Most spammers on IH are humans. They're crazy persistent, and they tend to post at regular times each day. I'm not sure what their deal is. We delete their spam pretty quickly, so they're not in it for the traffic. We also add
rel="nofollow"
to links, so it's not for SEO juice. My guess is that they're employees of cheap firms paid by clients to post spam in various places, and they're just going through the motions.The best future methods will probably involve more restrictions on brand new accounts. Basic stuff like requiring email verification to post, or requiring point thresholds to post links, etc.
What are some automatic solutions to stop spam where the content/audience is not in english language? Because I understand that these automatic solutions are based on languages rules..
If you don't know the language being spoken… I have no idea :-D
Thanks @csallen. I've considered using StopForumSpam, AbuseIPDB or OopSPAM. I will most probably give a try to one of those as well.
Depending on spam type and reasoning...
Many are either bot or low pay human work..
The incentive is usually some kind of marketing, getting backlinks for example...
Nice list. Also, I haven't put in use "User Agent" info yet. But it makes sense that it could be useful.
are the machine learning / Recursive Bayesian Filter fit cases where the website / audience language is not english?
you should be able to use it with the same success on non English data