We've all tried to scrape a website and found ourselves being limited in some way, either through elements not loading, or being outright blocked. At a certain point, scraping the site becomes a matter of pride.
What sites have you struggled against and finally overcome? Which ones didn't you manage to overcome?
We've overcome lazy-loading images, smooth scrolling, ads, vertically resizing images, page never finishes 'loading' because of ads. Also had some fun scraping Amazon product pages and dealing with IP allocation in GCP.
I can't even list it all.
TBH I'd never build a product based on scraping of "random websites" again. So much work and it's never 100% rock solid :/
Haha yes, its shifting sands isn't it. Lazy loading is particularly a pain, it can be difficult to determine at first what is happening without the right tools. Waiting on the page for a certain time limit or waiting for a particular element to load seem like good ways around it.
Not the best or the hardest, but in short what I recently discovered for the side project I am building using Automatio.co, Webflow, Zapier and Google Sheet, is that one of the website which I am monitoring, is displaying different data for different IP's / countries.
It's not that data is internationalized, but its a trick which they use to make it harder for scrapers to collect their data. So you basically need to run multiple scraper for given website and access from ~10 - 15 different IP's to get all data.
Routing the traffic through obscure networks could be the most reliable way.
e.g.
I haven't used the above services, just sharing my opinion from technology standpoint.
I'd say based on what I've seen, only a very small minority of web sites employ anti-web scraping that would require this sort of thing, I suspect a lot of people simply give up and move on! I think its largely a volume thing, the higher the volume of pages visited and load on the website the more this stuff becomes an issue
Sites which are worth scrapping i.e. contain valuable data are the ones which employ stringent anti-scrapping techniques as they are ones facing industrial scale scrapping.
I was scrapping Google AdWords tool at a time they were doing drastic work on UX multiple times a week, really bad timing ^^
Haha yes! UI/UX changes are the Achilles heel of web scrapers. "Self healing" scrapers have long been talked about, but a complete rework of the site seems like a pretty insurmountable challenge
The changes were really big.. at times I thought I was scarpping under a/b testing of different versions as well.. like everything changed several times, layout, terminology, buttons, columns not just trivial stuff
Wow that is really challenging, I can imagine that would be the case with a product with such high volume though