Back in 2017 I was working at a Fortune 500 company, leading the engineering efforts of a team whose whole purpose was data mining and data extraction. The business world moves swiftly and keeping up with the ever-changing world of eCommerce products can be overwhelming, especially when customers have plenty of options on where to shop. These days web scraping amongst big e-commerce companies is ubiquitous due to the advantages that data-based decision making can bring to remain competitive in such a tight margin business.
This is why we had a team to build, maintain, and run scrapers that would offer updates on pricing, product availability, and other details of products across eCommerce websites by crawling them at custom intervals.
Web scraping can look deceptively easy these days. There are numerous open-source libraries/frameworks, visual scraping tools, and data extraction tools that make it very easy to scrape data from a website. However, when you want to scrape websites at scale things start to get very tricky, very fast. Unlike your standard web scraping application, scraping e-commerce product data at scale has a unique set of challenges that make web scraping vastly more difficult. At its core, the problems we faced can be boiled down to 2 things one would require a humongous amount of repetitive effort and one boiled down to scaling costs.
Sloppy and Always Changing Website Formats
When scraping at our scale, we had to navigate hundreds of websites with sloppy code, having to deal with constantly evolving websites. That usually broke our spiders and we had to adapt and update them which kept 30 engineers busy.
Anti-Bot Countermeasures
If you are scraping e-commerce sites at scale you are guaranteed to run into websites employing anti-bot countermeasures. For most smaller websites their anti-bot countermeasures will be quite basic (ban IPs making excess requests). However, larger e-commerce websites make use of sophisticated anti-bot countermeasures that make extracting data not necessarily more difficult but definitely more expensive.
HOW CRAWLIFY WAS BORN
Given the amount of manpower sank into something as mundane as building and maintaining crawlers at my former job, I tried to implement and develop a more automated solution. I presented an MVP but my initiative had to navigate a lot of bureaucracy and didn t end up implemented.
Several months have passed and my code was collecting dust when a friend who was working at a prop-tech startup asked my help in extracting some data. I obliged and customized the code to automatically extract real estate data to train their algorithms. They were impressed with the solution which led me to develop Crawlify into a full-fledged saas platform.
Crawlify’s pretrained AI data extraction API can be tried and tested on our website. As with any self-respecting machine learning system, Crawlify’s AI adapts as it learns from customers’ data and the websites that it crawls. Our accuracy is over 98% and in the very few situations where our algorithms can’t automatically identify the correct data fields, it asks a human operator for feedback to improve from.