6
10 Comments

What are you best anti-web scraping stories?

We've all tried to scrape a website and found ourselves being limited in some way, either through elements not loading, or being outright blocked. At a certain point, scraping the site becomes a matter of pride.

What sites have you struggled against and finally overcome? Which ones didn't you manage to overcome?

  1. 2

    We've overcome lazy-loading images, smooth scrolling, ads, vertically resizing images, page never finishes 'loading' because of ads. Also had some fun scraping Amazon product pages and dealing with IP allocation in GCP.

    I can't even list it all.

    TBH I'd never build a product based on scraping of "random websites" again. So much work and it's never 100% rock solid :/

    1. 1

      Haha yes, its shifting sands isn't it. Lazy loading is particularly a pain, it can be difficult to determine at first what is happening without the right tools. Waiting on the page for a certain time limit or waiting for a particular element to load seem like good ways around it.

  2. 2

    Not the best or the hardest, but in short what I recently discovered for the side project I am building using Automatio.co, Webflow, Zapier and Google Sheet, is that one of the website which I am monitoring, is displaying different data for different IP's / countries.

    It's not that data is internationalized, but its a trick which they use to make it harder for scrapers to collect their data. So you basically need to run multiple scraper for given website and access from ~10 - 15 different IP's to get all data.

  3. 1

    Routing the traffic through obscure networks could be the most reliable way.
    e.g.

    1. There are services which offer payments for home users to share their bandwidth, I presume their IP addresses are pristine and have lesser chances of being flagged.
    2. Then there are services which employ real human beings to solve captcha and promise low latency.
    3. Then mimicking user behaviour as close as possible i.e. intervals between clicks, screen time etc.

    I haven't used the above services, just sharing my opinion from technology standpoint.

    1. 1

      I'd say based on what I've seen, only a very small minority of web sites employ anti-web scraping that would require this sort of thing, I suspect a lot of people simply give up and move on! I think its largely a volume thing, the higher the volume of pages visited and load on the website the more this stuff becomes an issue

      1. 1

        Sites which are worth scrapping i.e. contain valuable data are the ones which employ stringent anti-scrapping techniques as they are ones facing industrial scale scrapping.

  4. 1

    I was scrapping Google AdWords tool at a time they were doing drastic work on UX multiple times a week, really bad timing ^^

    1. 1

      Haha yes! UI/UX changes are the Achilles heel of web scrapers. "Self healing" scrapers have long been talked about, but a complete rework of the site seems like a pretty insurmountable challenge

      1. 1

        The changes were really big.. at times I thought I was scarpping under a/b testing of different versions as well.. like everything changed several times, layout, terminology, buttons, columns not just trivial stuff

        1. 1

          Wow that is really challenging, I can imagine that would be the case with a product with such high volume though

Trending on Indie Hackers
Getting first 908 Paid Signups by Spending $353 ONLY. 24 comments I talked to 8 SaaS founders, these are the most common SaaS tools they use 20 comments What are your cold outreach conversion rates? Top 3 Metrics And Benchmarks To Track 19 comments Hero Section Copywriting Framework that Converts 3x 12 comments Join our AI video tool demo, get a cool video back! 12 comments How I Sourced 60% of Customers From Linkedin, Organically 11 comments