Web Scraping With GPT

Web Scraping with GPT-4 | Kadoa | AI Web Scraper

Generate Web Scrapers For Any Website With GPT

kadoa.com

My co-founders and I are building a solution that fully automates web scraping using LLMs like GPT-4. Any feedback is appreciated.

Tavis Lochhead

submitted this link on April 23, 2023

Say something nice to tavis…

Post Comment

2

Who is the target group for this ?

snehalm

·
3 years ago
·
Reply
2

Definitely looking cool. I'll keep following you guys

JanSch

·
3 years ago
·
Reply
1

Hey, this is amazing. I tried a couple of sites that I was thinking on extracting some data from, and works like a charm.

I'm working on a job listings app, and it could really help if I could scrape Linkedin job listings after logging in.

contact_brilliant

·
3 years ago
·
Reply
1. 1
  
  Thanks! We can't help scraping LinkedIn after login, unfortunately.
  
  tavis
  
  ·
  3 years ago
  ·
  Reply
1

Hey bro! this is so good, but do you have plans to be able to access data behind a login

LouTromans

·
3 years ago
·
Reply
1. 1
  
  Yeah, we will. It's tricky to set up.
  
  tavis
  
  ·
  3 years ago
  ·
  Reply
1

Looking nice!

I saw this paper and implementation yesterday, might be of your interest: https://github.com/HazyResearch/evaporate

mmaia

·
3 years ago
·
Reply
1. 1
  Yes! Very interesting. Need to take a closer look at it. We've built our own tech using the same fundamentals:
  
  Generate web scraping code with AI (cheaper, more error-prone)
  
  Directly extract data with AI (expensive, less error-prone)
  tavis
  
  ·
  3 years ago
  ·
  Reply
  1. 1
    
    Tavis, I've come across your post, your idea sounds really good!. However I'm wondering if it's actually feasable to go for option 2. How can you send ~100k tokens to an LLM without loosing context for tag identification?
    
    titockmente
    
    ·
    3 years ago
    ·
    Reply
1

I tried with a web page which I visit everyday and it works great!
This would be great if scraping can be scheduled or integrated with other products such as notion. It would be awesome if I could use it as like Zapier, so I could make scraper to scrape news daily and save it to my notion database.

Jace

·
3 years ago
·
Reply
1. 1
  
  Cool! What are you trying to scrape?
  
  tavis
  
  ·
  3 years ago
  ·
  Reply
  1. 1
    
    I tried a tech news webpage. It is kind of HackerNews for Korean. Although scraping took some time, it worked well!
    
    Jace
    
    ·
    3 years ago
    ·
    Reply
    1. 1
      
      Nice. Yeah, we are building something that will help you
      
      Join our self-serve waitlist to be notified when it's ready
      https://www.kadoa.com/signup/self-serve
      
      tavis
      
      ·
      3 years ago
      ·
      Reply
1

It looks awesome.
There are so many requirements for scraping tools.
It can be really helpful for data scientists and business analyst etc.
I am just curious that how you bypass the bot detection guard in some websites.

Anyway, this idea is really cool

bluesky0724

·
3 years ago
·
Reply
1. 1
  
  Thanks! We are using the latest in anti-scraping bypassing.
  
  tavis
  
  ·
  3 years ago
  ·
  Reply
1

Cool! Could gpt figure out the type of webpage insted of asking the user to pick it? 🤔

aaddrriiaann

·
3 years ago
·
Reply
1. 1
  
  Can you provide an example?
  
  tavis
  
  ·
  3 years ago
  ·
  Reply
  1. 1
    
    Currently there are way too many input fields on the page. Ideally it should be 1 field, the URL. AI should do everything then let the user fine tune it.
    
    aaddrriiaann
    
    ·
    3 years ago
    ·
    Reply
1

Is this using scrapeghost under the hood or something custom coded?

timbowhite

·
3 years ago
·
Reply
1. 1
  
  Custom. This generates scrapers. Scrapeghost uses gpt to scrape directly.
  
  tavis
  
  ·
  3 years ago
  ·
  Reply
1

I am wondering which API you are using to make this possible.

staticmaker2022

·
3 years ago
·
Reply
1. 1
  
  GPT in combination with other traditional web scraping APIs.
  
  tavis
  
  ·
  3 years ago
  ·
  Reply
1

I kept getting this error

"An unexpected error has occurred. Please try again or contact us at [email protected] for assistance. While our playground may not work with all websites, we are here to provide the necessary support."

The concept is cool, but it's a bit slow.

Amaning

·
3 years ago
·
Reply
1. 1
  
  Same error
  
  mmaia
  
  ·
  3 years ago
  ·
  Reply
2. 1
  
  I'm also getting this error. I tested using a job site on lever.
  
  markyi
  
  ·
  3 years ago
  ·
  Reply
3. 1
  
  What URL are you trying to scrape? I'll take a look.
  
  tavis
  
  ·
  3 years ago
  ·
  Reply
0

Some steps that you can follow to perform web scraping:

Install necessary libraries: You can use Python libraries such as BeautifulSoup, Scrapy, or Selenium to scrape the web. You can install these libraries using pip or conda.

Identify the website to scrape: Choose a website that you want to scrape and identify the data you want to extract. It's important to keep in mind that some websites may have legal restrictions on web scraping, so make sure you're complying with any relevant laws.

Write your scraper: Use your preferred library to create a scraper that extracts the data you need from the website. Depending on the library you're using, you may need to write some code to navigate the website, locate the data you need, and extract it.

Preprocess your data: Once you've extracted the data, you may need to preprocess it to clean it up or convert it to a different format. This could involve removing duplicates, filtering out irrelevant information, or converting the data to a different file format.

Analyze your data: You can use GPT-4 to analyze your scraped data by training a model on it or using GPT-4's built-in natural language processing (NLP) capabilities to extract insights. You could also use GPT-4 to generate new content based on the data you've scraped.

Overall, web scraping with GPT-4 involves using Python libraries to extract data from websites, preprocess the data, and analyze it using GPT-4's NLP capabilities. However, it's important to keep in mind that web scraping may have legal and ethical implications, so make sure you're following best practices and complying with any relevant laws.

YoGet9ouere

·
3 years ago
·
Reply
1. 1
  
  I’ve asked ChatGPT enough questions about web scraping to know for a fact this is GPT-generated.
  
  tavis
  
  ·
  3 years ago
  ·
  Reply