5
28 Comments

Web Scraping With GPT

My co-founders and I are building a solution that fully automates web scraping using LLMs like GPT-4. Any feedback is appreciated.

submitted this link on April 23, 2023
  1. 2

    Who is the target group for this ?

  2. 2

    Definitely looking cool. I'll keep following you guys

  3. 1

    Hey, this is amazing. I tried a couple of sites that I was thinking on extracting some data from, and works like a charm.

    I'm working on a job listings app, and it could really help if I could scrape Linkedin job listings after logging in.

    1. 1

      Thanks! We can't help scraping LinkedIn after login, unfortunately.

  4. 1

    Hey bro! this is so good, but do you have plans to be able to access data behind a login

    1. 1

      Yeah, we will. It's tricky to set up.

  5. 1

    Looking nice!

    I saw this paper and implementation yesterday, might be of your interest: https://github.com/HazyResearch/evaporate

    1. 1

      Yes! Very interesting. Need to take a closer look at it. We've built our own tech using the same fundamentals:

      1. Generate web scraping code with AI (cheaper, more error-prone)
      2. Directly extract data with AI (expensive, less error-prone)
      1. 1

        Tavis, I've come across your post, your idea sounds really good!. However I'm wondering if it's actually feasable to go for option 2. How can you send ~100k tokens to an LLM without loosing context for tag identification?

  6. 1

    I tried with a web page which I visit everyday and it works great!
    This would be great if scraping can be scheduled or integrated with other products such as notion. It would be awesome if I could use it as like Zapier, so I could make scraper to scrape news daily and save it to my notion database.

    1. 1

      Cool! What are you trying to scrape?

      1. 1

        I tried a tech news webpage. It is kind of HackerNews for Korean. Although scraping took some time, it worked well!

        1. 1

          Nice. Yeah, we are building something that will help you

          Join our self-serve waitlist to be notified when it's ready
          https://www.kadoa.com/signup/self-serve

  7. 1

    It looks awesome.
    There are so many requirements for scraping tools.
    It can be really helpful for data scientists and business analyst etc.
    I am just curious that how you bypass the bot detection guard in some websites.

    Anyway, this idea is really cool

    1. 1

      Thanks! We are using the latest in anti-scraping bypassing.

  8. 1

    Cool! Could gpt figure out the type of webpage insted of asking the user to pick it? 🤔

    1. 1

      Can you provide an example?

      1. 1

        Currently there are way too many input fields on the page. Ideally it should be 1 field, the URL. AI should do everything then let the user fine tune it.

  9. 1

    Is this using scrapeghost under the hood or something custom coded?

    1. 1

      Custom. This generates scrapers. Scrapeghost uses gpt to scrape directly.

  10. 1

    I am wondering which API you are using to make this possible.

    1. 1

      GPT in combination with other traditional web scraping APIs.

  11. 1

    I kept getting this error

    "An unexpected error has occurred. Please try again or contact us at [email protected] for assistance. While our playground may not work with all websites, we are here to provide the necessary support."

    The concept is cool, but it's a bit slow.

    1. 1

      I'm also getting this error. I tested using a job site on lever.

    2. 1

      What URL are you trying to scrape? I'll take a look.

  12. 0

    Some steps that you can follow to perform web scraping:

    Install necessary libraries: You can use Python libraries such as BeautifulSoup, Scrapy, or Selenium to scrape the web. You can install these libraries using pip or conda.

    Identify the website to scrape: Choose a website that you want to scrape and identify the data you want to extract. It's important to keep in mind that some websites may have legal restrictions on web scraping, so make sure you're complying with any relevant laws.

    Write your scraper: Use your preferred library to create a scraper that extracts the data you need from the website. Depending on the library you're using, you may need to write some code to navigate the website, locate the data you need, and extract it.

    Preprocess your data: Once you've extracted the data, you may need to preprocess it to clean it up or convert it to a different format. This could involve removing duplicates, filtering out irrelevant information, or converting the data to a different file format.

    Analyze your data: You can use GPT-4 to analyze your scraped data by training a model on it or using GPT-4's built-in natural language processing (NLP) capabilities to extract insights. You could also use GPT-4 to generate new content based on the data you've scraped.

    Overall, web scraping with GPT-4 involves using Python libraries to extract data from websites, preprocess the data, and analyze it using GPT-4's NLP capabilities. However, it's important to keep in mind that web scraping may have legal and ethical implications, so make sure you're following best practices and complying with any relevant laws.

    1. 1

      I’ve asked ChatGPT enough questions about web scraping to know for a fact this is GPT-generated.

Trending on Indie Hackers
Why Indie Founders Fail: The Uncomfortable Truths Beyond "Build in Public" User Avatar 136 comments Your AI Product Is Not A Real Business User Avatar 85 comments I got tired of "opaque" flight pricing →built anonymous group demand →1,000+ users User Avatar 48 comments The Clarity Trap: Why “Pretty” Pages Kill Profits (And What To Do Instead) User Avatar 33 comments I built an enterprise AI chatbot platform solo — 6 microservices, 7 channels, and Claude Code as my co-developer User Avatar 28 comments I went from 40 support tickets/month to 8 — by stopping the question before it was asked User Avatar 16 comments