Webscraping October 21, 2020

What Project Are You Building With Web Scraping?

Bryce Davies @brycedavies

I've seen lots of cool projects being build in IH that revolve around web scraping, a lot of them are aggregators of some sort. I've begun looking at building out a job board using web scraping to collect job postings from other sites and repackage them.

Another great project I've seen is from a ScrapeDiary community member who built this Udemy course enroller bot;

https://github.com/aapatre/Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE

What cool projects have you seen / working on at the moment?

  1. 3

    https://browseai.com/ is a cool one by an indie hacker

    1. 1

      @Keuxdi thanks for the mention!

      @brycedavies the job posting scraping use case for job boards is actually one of our main example use cases that we mention in our pitches! Let me know if you want to hear how we can help you with your project.

      1. 1

        Hey @ardalan, I’m also working on a job board and looking into scraping tools. Can I find out more about what browseai can do, I saw that the LinkedIn template isn’t launched quite yet.. would love to hear more.

        1. 1

          Hey Paul, could you email me more info about your use case? I may be able to give you early access soon depending on what you need. My email is [email protected]

          1. 1

            What is your email, would like to engage.

            1. 1

              Hi Noa. You can find my email here.

              1. 1

                don't see it. you can ping my mail: [email protected]

      2. 1

        Love all the examples that you include to help demonstrate the value it can provide. Great to see another Canadian here as well!

      3. 1

        Hell yeah lets chat!

  2. 3

    www.activeforks.net
    Since there is no api to get data I need, decided to scrap it. Created newsletter around it to notify about new interesting repositories/forks.

    1. 1

      Oh thats really cool so this scrapes github?

      1. 1

        Sadly it does🙈 github api has rate limits and there is no way to increase them

  3. 2

    I make a a weekly/monthly data analysis newsletter about the Kindle Store for self-publishers. It's quite specialised, but the people who need it, really need it.

    Uber-short technical description: scraping the Kindle Store is done in Python using the scrapy library, and the data is extracted to JSON format and aggregated. Text analysis is done using spacy, and image analysis using imagehash. Plots are done using seaborn.

    The actual interesting code is mostly Python. It's running either at Scrapinghub (for the web-scraping part) or as Google Cloud Functions (which are basically free if you're not doing huge amounts of work). Once I produce a data-driven report in HTML format, I poke it through to MailerLite where the subscriber lists are, and post it on the website. The site itself isn't the 'product' - it's just the place where you go to sign up, choose the newsletters you want send, and download back issues. Most users won't hardly go there, and that's fine.

    More generally, this is a clunky-but-it-works framework for 'data-driven subscription newsletter generated by scraping stuff or other cloud analysis'. I'm quite keen to branch out and adapt it to other areas.

    1. 1

      This is such a great use case for web scraping, love it!

  4. 2

    I kind of let the site fall apart but I built http://www.themefolio.com/ a while back. It aggregates over 18,000 Shopify stores and groups them by themes and provides links to the store, a screenshot of the store, and a link to buy the theme.

    1. 1

      That's really cool - Shopify doesn't do a great job creating a directory that allows you to browse them all. Thought that the Shop app would help with this.. but it didn't.

  5. 2

    I tried to build a website a few years back ranking books by twitter mentions by scraping twitter & amazon. It was very naïve and I used the twitter streaming api where I filtered on amazon links, and the amazon advertising api to check if it was a book or not.

    I think it was a pretty cool idea, and I might get back to it, but I'm not sure if it's viable

    1. 1

      I wanted to do something similar. 😎 Viable in what way?

      1. 2

        It was mostly dealing with spam that was the issue I dealt with. Suddenly 2873 accounts RTs the same tweet in 1 second.

        EDIT: Also it was very random categories so perhaps niche it down

        1. 1

          Wow, dealing with spam like that can really drain the fun from a project.

          I would consider doing something besides books. There are so many lists that already exist and most people are probably satisfied with the NY Times’ bestseller list.

          Were url shorteners an issue? Were you planning to monetize it?

          1. 1

            Yeah I just didnt want to deal with it so I quit the project.

            URL shorteners werent an issue as twitters api could give me "expanded urls", and yes, I planned to monetize it using amazon affiliate links :)

            1. 1

              Nice. I can see the appeal in that. Could work out well.

              Let me know if you ever start back up on this project.

              PS - I lived in Eidsvoll for a few months.

  6. 1

    I’m working on remotely.gg. It is a tech remote job board.
    It is scraping indeed and GitHub currently.

  7. 1

    I recently build Maker News, which relies on web scraping since there's no Indie Hackers API.

  8. 1

    I built https://earlyname.com which checks if your username is available on new sites - has a bit of web scraping magic involved using Puppeteer. I think webscraping to aggregate data is really powerful and underused as a SaaS business model.

    1. 1

      100%, I spend a lot of my time trying to raise awareness that the tooling exists and its so accessible now

  9. 1

    https://quickapply.io/ we built a platform that allows students to apply to hundreds of internships with one form and a single click. We will have a web scraper to scrape Glassdoor/Lever/Linkedin jobs soon. For automation, we used Selenium webdriver.

  10. 1

    https://syften.com

    I scrape various forums. Some have APIs, but some need scraping.

    1. 1

      Nice! Imagine you would be running some pretty heavy scraping jobs then! What are you using for it?

  11. 1

    Nothing big, but I created my own R script to webscrape job postings for specific job titles :) Web scraping is awesome! You can do lots of stuff and automate mundane tasks

  12. 1

    A few years ago I created a website which scraped news sites and monitored changes in the articles. It was quite interesting to see how they rephrased and reframed articles even months after publishing. I stopped the project eventually because of legal issues and because the server costs there quite high for a non profit project.

  13. 1

    I love Tatoeba but they don't provide an API so I'm scraping it to get sentences for my kanji learning app. I also had a very simple Android app for Tatoeba made with React Native because the website is not very good to use on mobile browsers but I abandoned the project, hope to get back to it soon.

  14. 1

    Hey @brycedavies,

    I have built scrapers to populate or Professional Organizer Directory.

    https://stor.guru/organizer_directory

    We have over 4K entries that I scraped from various sites, then a secondary scraper to go to the Professional Organizers site to get more information like social media links, emails, or phone numbers.

    I use a python scrapy setup that works well. I would also like to implement this for Self Storage Units. Our product Stor.Guru is a personal home inventory system that allows people to organize their things with each other in real time making it easier to keep track of your things!

  15. 1

    I have built many scrapers over time. Some to collect data from Twitter; to crawl Chrome web store to find new extensions daily; and I collect newspaper headlines from a Finnish newspaper. All of these are personal interest topics of sorts and intended for some personal project that never really materialized. The newspaper thing is very niche (because of language) and I have put that data also on GitHub for everyone to use.

  16. 1

    Boostlane.com loves Web scraping. That's how we curate the news and organize it into topics such as https://boostlane.com/h/entrepreneur. It's all done by bots.

  17. 1

    I'm building https://soundartlist.com/ and currently working on a open source scraping module !

  18. 1

    https://www.seoly.app/ - we're building an affordable SEO tool for small businesses and agencies :)

    1. 1

      Sounds good! Under $100 a month and you're definitely saving money there

  19. 1

    Scraping google and YouTube on www.regirank.com to find search engine ranking position (serp) for the videos added on the platform.

    1. 1

      What proxy service are you using to avoid getting blocked?

      1. 1

        I use rotating proxies, most of them are blocked but that isn't such a big probleme when you get a new ip every request. You just need more processor power.

    2. 1

      Nice, SERPS are hard! Need good proxies for them hey

      1. 1

        Yeah, especially google

  20. 1

    https://listt.xyz/how-i-built-this
    https://listt.xyz

    Built this in an hour with search, filters, and other features too. Ofcourse this is built using my own saas product though.

    1. 1

      Nice! So you've scraped a bunch of different sources and made them available via API's?

  21. 1

    I would like to agregrate ad from craiglist, facebook post group about Free haircut. But i'm thinking if I can do it according to the content rules of those website .. and how to do it in no code

    1. 1

      Plenty of good no code tools around! Also who doesn't love a free haircut? There is a reference to automatio.co above and you could also look at datagrab.io.

      Many more good examples as well depending on what you need!

      1. 1

        Can you check it and tell me what you think about ?https://hopenbeauty.glideapp.io/

      2. 1

        thanks, but not possible with datagrab.io to scrap facebook group posts. I'm continue to searching

  22. 0

    I am building a SaaS using no-code tools and my own platform Automatio.co, which will monitor most of the Disposable email services and create and fresh DB of disposable domains, and generate API from it so services, businesses, communities can use it to protect themself from spamming, abuse, etc.

    Will share more about it soon.

    1. 2

      I'm very keen to learn more definitely keep us in the loop!

Recommended Posts