9
19 Comments

What Would You Teach a Web Scraping Beginner?

We see questions every day from beginners wondering how to get started web scraping. Often they're asking about the best language to learn or the best libraries or tools. If you had to give advice to a complete beginner, how would you tell them to start, what is really important to learn and what isn't?

  1. 3

    It really depends on what are your needs. Do you need fast mass scraper without rendering the page, or emulating the human behavior. If you are programmer (wanna learn to code) or non-technical person who is looking for no-code tool.

    If you wanna write code and have full control of it, I would suggest Puppeteer.js, a headless chrome browser.
    You will have to design the logic, deal proxies, scaling, deployment. Depending on your needs.

    If you are looking for no-code solution, there are plenty of no-code or low code scrapers. Tools like SimpleScraper, Octoparse, Importio and so on. Most of them can satisfy your needs if the webpage you are dealing is standard, more like simple one.

    If you need to deal with complex scenarios, where your bot need to click, input some data dynamically, solve captchas, rotate proxies and not worrying for scaling, I would suggest you to check the tool I am building for last couple of years. It's called Automatio.co, and it's not just another web scraper, but actually visual web bot builder, which help you to deal with all kind of complex scenarios on the web.

    Idea behind Automatio is to be able to reproduce the whatever manual work you have, and create a bot without writing a single line of code.

    Hope this helps.

  2. 2

    Learn CSS.

    The biggest stumbling block I've observed is when users know what data they want from a page but don't know how to reference it.

    It's easy to scrape when the element has a convenient, unique ID but what if the ids and classnames are rotated / obfuscated? Or there simply aren't any IDs or classnames to begin with?

    Learning about CSS and selectors / tag hierarchy gives you super powers in that regard.

    1. 1

      +1 for CSS selectors for sure, this is the main thing that breaks scrapers. Learning to build them defensively and use more robust selects ( I say more because its never 100% ) helps

  3. 1

    Hidden gem alert: Learn about XPath! I've been doing web dev for ages but never really got familiar with it until I had to do some webscraping of my own. It's really a game-changer for scrapers and enables you to pull off some actually neat tricks in a handful of keystrokes.

  4. 1

    Python is the easiest and with a lot of libraries available for scraping. Learn the basics then check out requests, beautifulsoup and lxml module. Start by scraping some simple directory like website.

    CSS selector and XPATH are important skills to have for web scraping. CSS selector is easy, start with that but XPATH can be handy at times. Also learn about file handling and databases. Often you will saving the scraped data somewhere so its important.

    Finally if you want to try some advance stuff, check regular expression for complex text parsing.

  5. 1

    Learn html and use a scraping api

  6. 1

    Kind offtopic, but i just thought it would be fun to build a series of webscraping challenges for beginners to learn on, each with a different problem (e.g. APIs, grabbing data from DOM, bypassing rate limiting, etc.)

    1. 2

      That sounds super cool, have you started building this? If not, we should!

      1. 1

        Haha I don't know that i have the time to invest in it at the moment unfortunately.

    2. 2

      This comment was deleted 3 years ago.

      1. 1

        Nice this is a cool resource. Would be fun to pool datasets too.

  7. 1

    I’d say spend a lot of time on the models of the data you need and build back from that; like really good, really thorough models. When I was early on there was stuff I didn’t realize I needed until later and then it was a pain to go back and integrate that into the process. It’s not important to learn more than one JavaScript rendering platform. For example, it’s usually not necessary to learn Selenium and Puppeteer or Playwright.

    1. 1

      Yep I think you're right, knowing how to organise everything is really the meta skill, everything else is just implementation

  8. 1

    This comment was deleted 3 years ago.

    1. 1

      "If it's repetitive, try to be organized, use a database. Don't stack files."

      Oh boy this hit me hard hahaha

Trending on Indie Hackers
How I grew a side project to 100k Unique Visitors in 7 days with 0 audience 49 comments Competing with Product Hunt: a month later 33 comments Why do you hate marketing? 29 comments My Top 20 Free Tools That I Use Everyday as an Indie Hacker 16 comments $15k revenues in <4 months as a solopreneur 14 comments Use Your Product 13 comments