What Project Are You Building With Web Scraping?

I've seen lots of cool projects being build in IH that revolve around web scraping, a lot of them are aggregators of some sort. I've begun looking at building out a job board using web scraping to collect job postings from other sites and repackage them.

Another great project I've seen is from a ScrapeDiary community member who built this Udemy course enroller bot;

https://github.com/aapatre/Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE

What cool projects have you seen / working on at the moment?

Bryce Davies

posted to

Webscraping

on October 21, 2020

Say something nice to brycedavies…

Post Comment

4

https://syften.com

I scrape various forums. Some have APIs, but some need scraping.

akfaew

·
5 years ago
·
Reply
1. 1
  
  Nice! Imagine you would be running some pretty heavy scraping jobs then! What are you using for it?
  
  brycedavies
  
  ·
  5 years ago
  ·
  Reply
3

I kind of let the site fall apart but I built http://www.themefolio.com/ a while back. It aggregates over 18,000 Shopify stores and groups them by themes and provides links to the store, a screenshot of the store, and a link to buy the theme.

Harrjm

·
5 years ago
·
Reply
1. 1
  
  That's really cool - Shopify doesn't do a great job creating a directory that allows you to browse them all. Thought that the Shop app would help with this.. but it didn't.
  
  xtyc
  
  ·
  5 years ago
  ·
  Reply
3

https://browseai.com/ is a cool one by an indie hacker

Keuxdi

·
5 years ago
·
Reply
1. 1
  
  @Keuxdi thanks for the mention!
  
  @brycedavies the job posting scraping use case for job boards is actually one of our main example use cases that we mention in our pitches! Let me know if you want to hear how we can help you with your project.
  
  ardalan
  
  ·
  5 years ago
  ·
  Reply
  1. 1
    
    Hey @ardalan, I’m also working on a job board and looking into scraping tools. Can I find out more about what browseai can do, I saw that the LinkedIn template isn’t launched quite yet.. would love to hear more.
    
    typeofgraphic
    
    ·
    5 years ago
    ·
    Reply
    1. 1
      
      Hey Paul, could you email me more info about your use case? I may be able to give you early access soon depending on what you need. My email is ardy@b...
      
      ardalan
      
      ·
      5 years ago
      ·
      Reply
      1. 1
        
        What is your email, would like to engage.
        
        Noa
        
        ·
        5 years ago
        ·
        Reply
        
        1
        
        Hi Noa. You can find my email here.
        
        ardalan
        
        ·
        5 years ago
        ·
        Reply
        
        1
        
        don't see it. you can ping my mail: [email protected]
        
        Noa
        
        ·
        5 years ago
        ·
        Reply
  2. 1
    
    Love all the examples that you include to help demonstrate the value it can provide. Great to see another Canadian here as well!
    
    xtyc
    
    ·
    5 years ago
    ·
    Reply
    1. 1
      
      Thanks mate!
      
      ardalan
      
      ·
      5 years ago
      ·
      Reply
  3. 1
    
    Hell yeah lets chat!
    
    brycedavies
    
    ·
    5 years ago
    ·
    Reply
    1. 1
      
      hit me up! ardy@b...
      
      ardalan
      
      ·
      5 years ago
      ·
      Reply
3

www.activeforks.net
Since there is no api to get data I need, decided to scrap it. Created newsletter around it to notify about new interesting repositories/forks.

floatas

·
5 years ago
·
Reply
1. 1
  
  Oh thats really cool so this scrapes github?
  
  brycedavies
  
  ·
  5 years ago
  ·
  Reply
  1. 1
    
    Sadly it does🙈 github api has rate limits and there is no way to increase them
    
    floatas
    
    ·
    5 years ago
    ·
    Reply
2

I built Dropl.io - We scrape content from the Archive.org for expired domains and domains pending delete from the registry. All content, i.e. homepages and blog posts are indexed on Dropl and searchable.

In simple terms, I built a search engine for expired domains and expired articles. :)

develanet

·
5 years ago
·
Reply
2

I’m working on remotely.gg. It is a tech remote job board.
It is scraping indeed and GitHub currently.

cyclone

·
5 years ago
·
Reply
2

I make a a weekly/monthly data analysis newsletter about the Kindle Store for self-publishers. It's quite specialised, but the people who need it, really need it.

Uber-short technical description: scraping the Kindle Store is done in Python using the scrapy library, and the data is extracted to JSON format and aggregated. Text analysis is done using spacy, and image analysis using imagehash. Plots are done using seaborn.

The actual interesting code is mostly Python. It's running either at Scrapinghub (for the web-scraping part) or as Google Cloud Functions (which are basically free if you're not doing huge amounts of work). Once I produce a data-driven report in HTML format, I poke it through to MailerLite where the subscriber lists are, and post it on the website. The site itself isn't the 'product' - it's just the place where you go to sign up, choose the newsletters you want send, and download back issues. Most users won't hardly go there, and that's fine.

More generally, this is a clunky-but-it-works framework for 'data-driven subscription newsletter generated by scraping stuff or other cloud analysis'. I'm quite keen to branch out and adapt it to other areas.

nosecroquet

·
5 years ago
·
Reply
1. 1
  
  This is such a great use case for web scraping, love it!
  
  brycedavies
  
  ·
  5 years ago
  ·
  Reply
2

I tried to build a website a few years back ranking books by twitter mentions by scraping twitter & amazon. It was very naïve and I used the twitter streaming api where I filtered on amazon links, and the amazon advertising api to check if it was a book or not.

I think it was a pretty cool idea, and I might get back to it, but I'm not sure if it's viable

borge

·
5 years ago
·
Reply
1. 1
  
  I wanted to do something similar. 😎 Viable in what way?
  
  goodpointdustin
  
  ·
  5 years ago
  ·
  Reply
  1. 2
    
    It was mostly dealing with spam that was the issue I dealt with. Suddenly 2873 accounts RTs the same tweet in 1 second.
    
    EDIT: Also it was very random categories so perhaps niche it down
    
    borge
    
    ·
    5 years ago
    ·
    Reply
    1. 1
      
      Wow, dealing with spam like that can really drain the fun from a project.
      
      I would consider doing something besides books. There are so many lists that already exist and most people are probably satisfied with the NY Times’ bestseller list.
      
      Were url shorteners an issue? Were you planning to monetize it?
      
      goodpointdustin
      
      ·
      5 years ago
      ·
      Reply
      1. 1
        
        Yeah I just didnt want to deal with it so I quit the project.
        
        URL shorteners werent an issue as twitters api could give me "expanded urls", and yes, I planned to monetize it using amazon affiliate links :)
        
        borge
        
        ·
        5 years ago
        ·
        Reply
        
        1
        
        Nice. I can see the appeal in that. Could work out well.
        
        Let me know if you ever start back up on this project.
        
        PS - I lived in Eidsvoll for a few months.
        
        goodpointdustin
        
        ·
        5 years ago
        ·
        Reply
1

I recently build Maker News, which relies on web scraping since there's no Indie Hackers API.

jakobgreenfeld

·
5 years ago
·
Reply
1

I built https://earlyname.com which checks if your username is available on new sites - has a bit of web scraping magic involved using Puppeteer. I think webscraping to aggregate data is really powerful and underused as a SaaS business model.

tinyprojects

·
5 years ago
·
Reply
1. 1
  
  100%, I spend a lot of my time trying to raise awareness that the tooling exists and its so accessible now
  
  brycedavies
  
  ·
  5 years ago
  ·
  Reply
1

https://quickapply.io/ we built a platform that allows students to apply to hundreds of internships with one form and a single click. We will have a web scraper to scrape Glassdoor/Lever/Linkedin jobs soon. For automation, we used Selenium webdriver.

alexthehacker1

·
5 years ago
·
Reply
1

Nothing big, but I created my own R script to webscrape job postings for specific job titles :) Web scraping is awesome! You can do lots of stuff and automate mundane tasks

liv_chua

·
5 years ago
·
Reply
1

A few years ago I created a website which scraped news sites and monitored changes in the articles. It was quite interesting to see how they rephrased and reframed articles even months after publishing. I stopped the project eventually because of legal issues and because the server costs there quite high for a non profit project.

mkleimann

·
5 years ago
·
Reply
1

I love Tatoeba but they don't provide an API so I'm scraping it to get sentences for my kanji learning app. I also had a very simple Android app for Tatoeba made with React Native because the website is not very good to use on mobile browsers but I abandoned the project, hope to get back to it soon.

opauloantonio

·
5 years ago
·
Reply
1

Hey @brycedavies,

I have built scrapers to populate or Professional Organizer Directory.

https://stor.guru/organizer_directory

We have over 4K entries that I scraped from various sites, then a secondary scraper to go to the Professional Organizers site to get more information like social media links, emails, or phone numbers.

I use a python scrapy setup that works well. I would also like to implement this for Self Storage Units. Our product Stor.Guru is a personal home inventory system that allows people to organize their things with each other in real time making it easier to keep track of your things!

zinglax

·
5 years ago
·
Reply
1

I have built many scrapers over time. Some to collect data from Twitter; to crawl Chrome web store to find new extensions daily; and I collect newspaper headlines from a Finnish newspaper. All of these are personal interest topics of sorts and intended for some personal project that never really materialized. The newspaper thing is very niche (because of language) and I have put that data also on GitHub for everyone to use.

neea

·
5 years ago
·
Reply
1

Boostlane.com loves Web scraping. That's how we curate the news and organize it into topics such as https://boostlane.com/h/entrepreneur. It's all done by bots.

technopreneur

·
5 years ago
·
Reply
1

Blog/newsletters aggregator https://whatnot.ai

slavaGanzin

·
5 years ago
·
Reply
1

I'm building https://soundartlist.com/ and currently working on a open source scraping module !

Steph_

·
5 years ago
·
Reply
1

https://www.seoly.app/ - we're building an affordable SEO tool for small businesses and agencies :)

cosoare

·
5 years ago
·
Reply
1. 1
  
  Sounds good! Under $100 a month and you're definitely saving money there
  
  brycedavies
  
  ·
  5 years ago
  ·
  Reply
1

Scraping google and YouTube on www.regirank.com to find search engine ranking position (serp) for the videos added on the platform.

MMike

·
5 years ago
·
Reply
1. 1
  
  try scrapingdog.com to scrape google.
  
  eventezycom
  
  ·
  5 years ago
  ·
  Reply
2. 1
  
  What proxy service are you using to avoid getting blocked?
  
  aminmemon
  
  ·
  5 years ago
  ·
  Reply
  1. 1
    
    I use rotating proxies, most of them are blocked but that isn't such a big probleme when you get a new ip every request. You just need more processor power.
    
    MMike
    
    ·
    5 years ago
    ·
    Reply
3. 1
  
  Nice, SERPS are hard! Need good proxies for them hey
  
  brycedavies
  
  ·
  5 years ago
  ·
  Reply
  1. 1
    
    Yeah, especially google
    
    MMike
    
    ·
    5 years ago
    ·
    Reply
1

https://listt.xyz/how-i-built-this
https://listt.xyz

Built this in an hour with search, filters, and other features too. Ofcourse this is built using my own saas product though.

upenv

·
5 years ago
·
Reply
1. 1
  
  Nice! So you've scraped a bunch of different sources and made them available via API's?
  
  brycedavies
  
  ·
  5 years ago
  ·
  Reply
1

I would like to agregrate ad from craiglist, facebook post group about Free haircut. But i'm thinking if I can do it according to the content rules of those website .. and how to do it in no code

koffi

·
5 years ago
·
Reply
1. 1
  
  Plenty of good no code tools around! Also who doesn't love a free haircut? There is a reference to automatio.co above and you could also look at datagrab.io.
  
  Many more good examples as well depending on what you need!
  
  brycedavies
  
  ·
  5 years ago
  ·
  Reply
  1. 1
    
    Can you check it and tell me what you think about ?https://hopenbeauty.glideapp.io/
    
    koffi
    
    ·
    5 years ago
    ·
    Reply
  2. 1
    
    thanks, but not possible with datagrab.io to scrap facebook group posts. I'm continue to searching
    
    koffi
    
    ·
    5 years ago
    ·
    Reply
0

I am building a SaaS using no-code tools and my own platform Automatio.co, which will monitor most of the Disposable email services and create and fresh DB of disposable domains, and generate API from it so services, businesses, communities can use it to protect themself from spamming, abuse, etc.

Will share more about it soon.

kinder

·
5 years ago
·
Reply
1. 2
  
  Great idea! Cheers, @kinder!
  
  livekth
  
  ·
  5 years ago
  ·
  Reply
2. 2
  
  I'm very keen to learn more definitely keep us in the loop!
  
  brycedavies
  
  ·
  5 years ago
  ·
  Reply