3
10 Comments

How to scrape correctly?

Hey guys,

i would like to build a tool (SaaS) where users can fetch the best performing posts on Instagram. I found some interesting libraries for this. But I am wondering how I can do this, without being blocked by Instagram. For sure I will need a lot of proxies. But how exactly can I do this? Where to buy this proxies? How to rotate them?

I hope you understand my current problem.

PS: I have already build a tool like this, but I am limited to business accounts only, because I am using the official Facebook api. The tool is: https://virallyze.com

I have currently 1200 registered users. So I think there are definitely people who would like to use such a tool :)

posted to Icon for group Developers
Developers
on July 2, 2020
  1. 2

    You're looking for residential proxies, which are expensive. There are lots of scraping services that take care of the network portion for you. They essentially build their own residential proxy networks or perhaps have arrangements with them.

  2. 2

    I'm currently building scraping bots for our project. Check if you can access the site via tor network coz that makes rotating IPs as easy as redeploying tor proxy containers in k8s cluster 😉

  3. 2

    We had a similar issue at Browse AI. We researched a lot of proxy service providers and eventually found 2 good ones:

    Both are quite pricey when your data transfer is significant because they charge per GB.

    p.s. I wish we were a bit further with our product features so you could you use it to build your tool! We're adding a few capabilities that you'd need over the next 3 months (public API, for example). If you're interested, you can sign up and I'll email you monthly updates.

  4. 2

    I wrote a Design Doc on how to scrape wikipedia using 10,000 machines such that you only fetch each URL one time and I minimize network traffic by using distributed systems techniques.

    Deploying these machines across a few cloud providers and maybe using a proxy service (like other have mentioned) would get you there.

    Design a Distributed Web Crawler

    Let me know if you have any questions!

    1. 1

      I guess Wikipedia is just an example in your case but just in case someone else sees that. Please don't scrape Wikipedia like that. Use the official dump and don't make them work through more requests than they already do:

      https://dumps.wikimedia.org

      1. 1

        They haven't updated their html dump for about 12 years.

          1. 1

            Ah, yeah, I can see how this is confusing for you. If you read my paper, you will see that the goal is to fetch the HTML copy of wikipedia and not needing any image content. You linked to something a bit different. Those files are in SQL and XML format.

            The static html dumps haven't been updated since 2008.

  5. 2

    You're almost there.

    Yes, you need to use a proxy service to help you rotate IP addresses. Here's one, but there's loads of these out there:

    https://instantproxies.com/pricing/

    It's then just a case of using whatever method you were using to fetch HTML, but adding in the proxy as a parameter. Most libraries for making HTTP requests will have this built in, like curl:

    https://ec.haxx.se/usingcurl/usingcurl-proxies

    After you have successfully grabbed the HTML then you have to parse out the data you want but I presume you already know how to do that. There are a number of HTML parsing libraries out there - e.g. in Ruby we use Nokogiri:

    https://nokogiri.org

    Note that if you're scraping content that doesn't want to be scraped then you're probably violating some terms of service... be warned! And you're also entering into an arms race with the owner of the platform; all it takes is for them to change their HTML in some way and your scrapers will break, let alone other techniques they could introduce like scrambling / honey pots etc.

    Good luck!

  6. 2

    This comment was deleted 4 years ago.

    1. 1

      There are other Services which are not using this api. The huge benefit of not using the api is 1) you are able to scan private profiles to and 2) people don't need to authenticate with facebook.

      1. 1

        This comment was deleted 4 years ago.

Trending on Indie Hackers
Priorities for launching a SaaS solo, with no budget User Avatar 211 comments I built a tool directory that doesn't pretend every founder has the same needs User Avatar 44 comments AI helped me ship faster. Then I forgot what my product actually does. User Avatar 12 comments I built a browser-based photo geotagging tool. What should I lead with? User Avatar 6 comments Why founder-led outbound breaks the moment you try to delegate it User Avatar 5 comments I got 10 signups in a week. None of them used the product. User Avatar 1 comment