What are you best anti-web scraping stories?

We've all tried to scrape a website and found ourselves being limited in some way, either through elements not loading, or being outright blocked. At a certain point, scraping the site becomes a matter of pride.

What sites have you struggled against and finally overcome? Which ones didn't you manage to overcome?

Bryce Davies

posted to

Webscraping

on October 22, 2020

Say something nice to brycedavies…

Post Comment

2

We've overcome lazy-loading images, smooth scrolling, ads, vertically resizing images, page never finishes 'loading' because of ads. Also had some fun scraping Amazon product pages and dealing with IP allocation in GCP.

I can't even list it all.

TBH I'd never build a product based on scraping of "random websites" again. So much work and it's never 100% rock solid :/

momoko

·
5 years ago
·
Reply
1. 1
  
  Haha yes, its shifting sands isn't it. Lazy loading is particularly a pain, it can be difficult to determine at first what is happening without the right tools. Waiting on the page for a certain time limit or waiting for a particular element to load seem like good ways around it.
  
  brycedavies
  
  ·
  5 years ago
  ·
  Reply
2

Not the best or the hardest, but in short what I recently discovered for the side project I am building using Automatio.co, Webflow, Zapier and Google Sheet, is that one of the website which I am monitoring, is displaying different data for different IP's / countries.

It's not that data is internationalized, but its a trick which they use to make it harder for scrapers to collect their data. So you basically need to run multiple scraper for given website and access from ~10 - 15 different IP's to get all data.

kinder

·
5 years ago
·
Reply
1
Routing the traffic through obscure networks could be the most reliable way.
e.g.
1. There are services which offer payments for home users to share their bandwidth, I presume their IP addresses are pristine and have lesser chances of being flagged.
2. Then there are services which employ real human beings to solve captcha and promise low latency.
3. Then mimicking user behaviour as close as possible i.e. intervals between clicks, screen time etc.
I haven't used the above services, just sharing my opinion from technology standpoint.
Abishek_Muthian

·
5 years ago
·
Reply
1. 1
  
  I'd say based on what I've seen, only a very small minority of web sites employ anti-web scraping that would require this sort of thing, I suspect a lot of people simply give up and move on! I think its largely a volume thing, the higher the volume of pages visited and load on the website the more this stuff becomes an issue
  
  brycedavies
  
  ·
  5 years ago
  ·
  Reply
  1. 1
    
    Sites which are worth scrapping i.e. contain valuable data are the ones which employ stringent anti-scrapping techniques as they are ones facing industrial scale scrapping.
    
    Abishek_Muthian
    
    ·
    5 years ago
    ·
    Reply
1

I was scrapping Google AdWords tool at a time they were doing drastic work on UX multiple times a week, really bad timing ^^

hatkyinc

·
5 years ago
·
Reply
1. 1
  
  Haha yes! UI/UX changes are the Achilles heel of web scrapers. "Self healing" scrapers have long been talked about, but a complete rework of the site seems like a pretty insurmountable challenge
  
  brycedavies
  
  ·
  5 years ago
  ·
  Reply
  1. 1
    
    The changes were really big.. at times I thought I was scarpping under a/b testing of different versions as well.. like everything changed several times, layout, terminology, buttons, columns not just trivial stuff
    
    hatkyinc
    
    ·
    5 years ago
    ·
    Reply
    1. 1
      
      Wow that is really challenging, I can imagine that would be the case with a product with such high volume though
      
      brycedavies
      
      ·
      5 years ago
      ·
      Reply