2
9 Comments

ScrapeMagic: Scrape news articles & sites just by asking questions using GPT-3

Hey Folks,

Just finished a little side project I thought I'd share here: https://www.scrapemagic.xyz/

It's an API & Chrome Extension that lets you scrape by just asking questions, or describing what you want from a given page. The A.I. (GPT-3) has deep understanding of the content on the page you're scraping, so it can pull structured data out of unstructured content like a news article.

For example, you can go to an article about fundraising and ask for:
"What company received funding"
"What was the amount of funding it received"
"Who were the main investors in this round"
"Where is the company headquartered"

I've been combining it with RSS feeds to create structured data feeds out of unstructured data.

The other week I attended an OpenAI event and realized you could probably extract this type of data. Built out the proof of concept API & Chrome Extension and wanted to share with the community to see if there's anyone hiring people to do this type of data extraction and wants to give the service a try.

Trying to share early and often as this is just a side-project I thought would be fun to get out there :) Also generally trying to invite more Internet serendipity!

on June 6, 2022
  1. 2

    Amazing work! We used a similar technique as a proof of concept at one of my jobs! I imagine it's something like beautiful soup -> GPT3 remove html tags -> ask question on clean text -> ask question -> get answer?

    1. 2

      That's awesome y'all came across a similar technique concept - how did it work out? Were you parsing websites?

      As far as the approach I took, what you outlined is roughly right!

      1. 1

        It works perfect for our use case! We mostly focus on websites that have a consistent set of information but where we don't know where in the text the answer is. So we just do what I outlined and then since we know most of the information is there we can reliably extract it. GPT-3 is wild, crazy useful to make smarter scrapers!

  2. 1

    this is so cool!! i'm thinking of building a feature for parsing recipes from websites - any advice on setting this up? particularly interested on how you deal with long pages given the GTP API limits

  3. 1

    How do you deal with long pages (since GTP limits 1000 tokens)?

  4. 1

    This looks super cool! Congrats!

    Question: when asking the AI to get back a response of "URL of the company that raised funding" why it is responding with the URL of the article?

    Suggestion: add a download button for all results when the data is loaded.

    Again, really nice idea!

    1. 2

      GPT-3 is not always accurate so sometimes it doesn't know what to answer and just answers with something that could make sense but isn't necessarily right.

    2. 1

      Thanks for the kind words @aldison!

      Appreciate the suggestions as well - will add in a way to download the data & send it to other tools from the Extension.

      Have any use cases you'd like to give it a spin for?

      As Dax42 mentioned, GPT-3 isn't always 100% accurate, so it does it's best - asking the "right" questions, or prompt engineering as it's called, ends up becoming its own art :)

      1. 1

        Have you tried turning down the temperature way down?

Trending on Indie Hackers
I'm a lawyer who launched an AI contract tool on Product Hunt today — here's what building it as a non-technical founder actually felt like User Avatar 150 comments A simple way to keep AI automations from making bad decisions User Avatar 55 comments “This contract looked normal - but could cost millions” User Avatar 54 comments Never hire an SEO Agency for your Saas Startup User Avatar 42 comments 👉 The most expensive contract mistakes don’t feel risky User Avatar 41 comments The indie maker's dilemma: 2 months in, 700 downloads, and I'm stuck User Avatar 40 comments