Reverse AMA: Tell everything you know about scraping!

by Omar

Hello everyone,

I'm trying a reverse AMA: a thread that where the contributors share knowledge instead of asking questions. The goal is to build a mini knowledge base on a particular topic, that everyone can benefit from.

For the context: I'm personally asking because I'm currently doing product research and customer development in this topic. Any knowledge contribution would be much appreciated.

Let's share all the business cases we know, useful resources, software and techniques about scraping.
I will personally share myself as my research progresses.

Omar

on October 13, 2019

Say something nice to omneity…

Post Comment

2

The first thing I would say is: don't worry about websites changing their markup or CSS classes. They mostly don't :)
I wrote scrappers for multiples housing websites and remember worrying about this. I setup loads of tests and monitoring to be make sure I detect any change in layouts... Guess what? It's been 3 years and still the same layout.

When possible open your network tab and reverse engineer their jSON API, they change even less often ! :)

Bipbop

·
6 years ago
·
Reply
1. 1
  
  That's an amazing tip @Bipbop! My experience concurs with yours, even for big names such as Amazon and Linkedin.
  
  I would add that the mobile API for websites with a mobile app can be an even juicier source :) Debug it using Fiddler or Charles, which are tools made for debugging network requests.
  
  omneity
  
  ·
  6 years ago
  ·
  Reply
1
I scraped everything from Google, to Amazon, Android Store, Apple Store Loopnet, prestashops, wordpress, instagram, websites, apis, ....

I have built and sold a project to compare e-commerce products. To check competitor's prices easily.
I also have built a scraper that works in the browser! But its very difficult to make it like a no-code tool, without little bit of code you are constantly finding new cases.

First thing when scrapping is to check if the target is server-side loaded or client-side.
- If its client-side you will need something like a virtual browser to render javascript (I use casper.js), because a simple curl call will not see anything.
- If you want to scrape a big site, use proxies, I use seoproxies.com, or you will be banned
I would say DON'T build your business on top of a website or api you can't control. If they change their terms, you are out. Happened to me at least 2 times. Thanks Skyscanner!

I will be happy to jump on any crawling&scraping project.
natzar

·
6 years ago
·
Reply
1. 1
  
  'CasperJS is no longer actively maintained' from the github repo.
  
  I would suggest nightmare instead for new projects. https://github.com/segmentio/nightmare
  If you can get the hang of casper, nightmare is a breeze.
  
  MrMiyagi
  
  ·
  6 years ago
  ·
  Reply
  1. 1
    
    Puppeteer is also very nice. Officially maintained by the Google Chrome team, especially since they enabled an also-official headless mode for Chrome.
    
    I recommend it: https://github.com/GoogleChrome/puppeteer
    
    On a separate topic, @natzar have you had use cases of recurrent scraping? where you'd periodically extract data from a webpage over and over?
    
    I'd be very interested to exchange with you and learn from your expertise and knowledge. Can I reach out to you via email?
    
    omneity
    
    ·
    6 years ago
    ·
    Reply
1

I've seen a lot of automated web scraping stuff pop up lately.

Here's a nice one I saw recently on IndieHackers: https://tryspider.com/

I've used scrapy before which is very popular (Python library): https://scrapy.org/

Recently discovered apify, a JavaScript equivalent of scrapy: https://apify.com/

dkb868

·
6 years ago
·
Reply
1

I’d looked into this once and found this web scraping library for .NET. (Not related to that company) Looked very powerful if that platform works for you. I never ended up trying it but the tutorials are an interesting read.

https://ironsoftware.com/csharp/webscraper/

CandelaSoftware

·
6 years ago
·
Reply
1. 1
  
  I think this would be great for Microsoft shops, or people who want to create desktop applications with scraping capabilities.
  
  Great find, thanks for sharing @CandelaSoftware!
  
  omneity
  
  ·
  6 years ago
  ·
  Reply
1

I have worked on Instagram scraping for the past few years. More specifically using new technologies to power on demand scraping at almost infinite scale. If that interests you then AMA!

JustinCruz

·
6 years ago
·
Reply
1. 1
  
  I actually do have a few questions, thanks for chiming in @JustinCruz!
  What's the use case for scraping instagram, and how is the data being used from a business perspective?
  
  Also, do you watch profiles for changes? What kind of data do you extract?
  
  omneity
  
  ·
  6 years ago
  ·
  Reply
  1. 2
    
    Mainly my goals revolve around hashtag utilization on Instagram. All sorts of frequencies and types of data. The insights collected power our SaaS app for Instagram business users, Curate. Here is a recent article I wrote that conveys what the business cares about.
    
    https://blog.curate-app.com/3-secret-strategies-using-hashtags-for-business-growth/
    
    JustinCruz
    
    ·
    6 years ago
    ·
    Reply
    1. 1
      
      That's some amazing insights! Perfectly answering my question.
      
      Thank you for sharing your secret sauce, I promise to handle it well. 🙇‍♂️
      
      Now a separate question, not to sell you on Monitoro or anything, but I'd be interested to explore whether a scraping case of your scale is feasible / makes sense the way Monitoro is built, purely hypothetically.
      
      Would you be interested to help me figure this out? I'll only need some input from you if you have time.
      
      My email is omar -at- monitoro.xyz
      
      omneity
      
      ·
      6 years ago
      ·
      Reply
1

Hah, nice idea :) I know nothing about scraping though!

rosiesherry

·
6 years ago
·
Reply
1. 1
  
  Haha no worries! :-)
  
  Let's start by a basic introduction to the topic:
  
  Scraping is the act of opening a website programmatically to retrieve structured data.
  In slightly more technical terms, if a typical website or webapp uses data to produce html, scraping is the opposite, which is taking html and producing data.
  
  Now why and what do you want to scrape?
  
  A common case for scraping is running market studies, such as collecting prices of a given product category, or to build databases, such as content creators on youtube.
  Now this data is expected to be used at a later stage in your business, to inform your decisions, or maybe to be your product itself (maybe you put sponsors in touch with content creators?)
  
  How can you perform scraping?
  
  There are a few dimensions you need to be concerned with. First and foremost, does the website you want to scrape use javascript, or is it plain HTML?
  This impacts everything down the line, from the technology needed to scrape to the cost of doing it.
  
  Once you figure that out (you can do so by disabling Javascript in your browser and checking if the website still works), the next question is your stack of choice.
  
  There are different options depending on whether you're a developer, or you prefer using click and point tools, or anywhere in between. There are also data brokers you can reach out to directly to avoid scraping yourself.
  
  Common technologies in the scraping space for developers are Puppeteer, Scrapy, BeautifulSoup, Cheerio ...
  
  You can also find several chrome extensions to scrape a website.
  
  Making the right choice depends on the website you want to scrape ( does it use Javascript, whether the website has specific countermeasures in place against scraping for example...), and other constraints on your project (how soon do you need the data, and there are appropriate tools to deal with these situations. For example, you'll often hear about (rotating) proxies, which are used to appear to target websites as a different user and avoid triggering all kinds of automated protections.
  
  Captcha is also another common constraint, which is only solved currently by working around the website itself, or by outsourcing the captcha solving to dedicated services (which typically are nothing else but sweatshops).
  
  I hope this serves as a practical introduction to the topic @rosiesherry. Let me know if you have further questions or want to get your hands dirty :)
  
  omneity
  
  ·
  6 years ago
  ·
  Reply