5
20 Comments

Who's built/used a LI/FB/Reddit scraper?

Has anyone here built/used a scraper/API to crawl information from:

  • FB profiles
  • FB groups
  • LI profiles
  • LI groups
  • Sub-Reddits

I know there's some complex stuff you have to do with proxies and such but was curious to hear if anyone knows of a tool/service to avoid all that?

I think it'd be a great way to do market research. Expecting to do this here and there at low to mid volume so paid is fine too.

  1. 4

    There's quite a few scraping products out there and most of the time one of their core value propositions is that they handle the IP rotation stuff for you.

    I believe https://www.scrapingbee.com is an Indiehacker project!

    FB / LI groups might be a bit more complicated though since I don't think the info there is public, it requires you to log in. For that scenario you'll need to look into scrapers that support more sophisticated browser interactions. Not sure if scrapingbee does that - perhaps @Daolf can chime in

    1. 2

      Thanks for the shoutout 😊.

      Here is some information about web scraping I can give you:

      Facebook and FB groups:
      Facebook is not that hard to scrape BUT you'll need JS rendering to do it and be careful about pages behind login wall. All of this can be solved by sending auth cookies of dummy accounts. Difficulty 3/5

      Reddit:
      Reddit is the easiest to scrape, their API is quite open. If you don't need to retrieve Reddit history I suggest you take a look at this: https://github.com/pushshift/reddit_sse_stream. Difficulty 1/5

      Linkedin:
      Linkedin is reportedly very very hard to scrape, and more importantly, they come very hard after the one who does. There's a couple of services out there that offer Linkedin scraping but it's very expensive, and in my experience, very slow and unreliable for real-time usage. Difficulty 5/5

      I hope it helps :)

    2. 1

      Right, IP rotation, JS expansion, and logging in are definitely the key value props. Checking scrapingbee out now. Thanks!

  2. 3

    Reddit has api or you can use the praw module... also a simple hack is adding .json to each subreddit url to get a JSON feed like https://www.reddit.com/r/startups.json

    For Linkedin and Facebook it's going to be harder. You can try phantombuster or create a browser automation script using selenium or puppeteer. In the past I have created some automation scripts that posts in fb pages, groups and li groups and scrape the stats of those post. Also scraping fb reviews for places and pages. It's going to hard to scale that and requires a lot of resources.

    1. 1

      Cool tip about Reddit.

      I'm not familiar with Phantombuster but have used selenium and puppeteer. Sounds like you're saying if I want to analyze a significant amount of information I should stick to finding a service. Makes sense

  3. 3

    I built a LI scraper with Selenium and learned pretty quickly that if you're doing any sort of repeated/similar query from a logged in session, LI is aggressive about shutting you down. Much better off scraping public profiles without logged in info.

    1. 1

      Thanks for sharing! Hopefully you didn't lose your actual LI profile. Have read about this before so it's interesting to hear someone's personal account of it.

    2. 1

      Totally agree, better to not risk your linkedin profile on this.

    1. 1

      Thanks for sharing! I Googled two/three years ago and a few rulings of "maybe it's okay" seemed to come up. Have personally seen a lot more companies using that data crop up since.

    1. 1

      Thanks Chris, this is specifically to extract events which I hadn't thought of as a source of potentially interesting information as well.

  4. 1

    hi @philipp - DM me, I have a service to scrape public Linkedin profile pages. I can let you know the costs based on your needs.

    1. 1

      There's no DM on IH and I can't find any information to contact you in your profile or products section.

  5. 1

    Another indiehacker project you can use is https://scraper.ai :)

    1. 1

      Got it, this is a on page extraction rather than an extraction service. I'll keep it in mind.

    2. 1

      Hey @maximg I was just checking out your site (pretty cool BTW!) I just wanted to let you know that I had some slightly irritating issues when I clicked on your menu from mobile (feel free to pm me for exact model) I took a screen shot but it doesn't seem I'm able to post it here unfortunately. Basically the menu would not only be stuck in the Un collapsed position, but it was also transparent unless I scrolled down, then the transparency would go away, but I still couldn't collapse it. (BTW the transparency was just as bad as not closing because it made the menu completely ineligible when first opened because of the background letters that got mixed in with the menu letters.)
      On a positive not I love everything else about the site and the app. I'm curious as to how many customers/revenue you have and or where your at in your journey! Take care!

      1. 1

        Hey, thanks for the feedback, we'll be optimizing the mobile experience soon!

        We started not that long ago and are now at around 2k ARR (very low still) and have attracted around 1.5k sign ups

  6. 2

    This comment was deleted 3 years ago.

    1. 1

      Oh nice! I had no idea Reddit lets you do that. And I didn't think of Instagram either. I guess that would require some Computer Vision to analyze that but doable and could be useful. 🤔

      Thanks for the resources!

Trending on Indie Hackers
How I grew a side project to 100k Unique Visitors in 7 days with 0 audience 49 comments Competing with Product Hunt: a month later 33 comments Why do you hate marketing? 28 comments My Top 20 Free Tools That I Use Everyday as an Indie Hacker 15 comments $15k revenues in <4 months as a solopreneur 14 comments Use Your Product 13 comments