How to scrape correctly?

by Pristupniik

Hey guys,

i would like to build a tool (SaaS) where users can fetch the best performing posts on Instagram. I found some interesting libraries for this. But I am wondering how I can do this, without being blocked by Instagram. For sure I will need a lot of proxies. But how exactly can I do this? Where to buy this proxies? How to rotate them?

I hope you understand my current problem.

PS: I have already build a tool like this, but I am limited to business accounts only, because I am using the official Facebook api. The tool is: https://virallyze.com

I have currently 1200 registered users. So I think there are definitely people who would like to use such a tool :)

Pristupniik

posted to

Developers

on July 2, 2020

Say something nice to Pristupniik…

Post Comment

2

You're looking for residential proxies, which are expensive. There are lots of scraping services that take care of the network portion for you. They essentially build their own residential proxy networks or perhaps have arrangements with them.

jborak

·
6 years ago
·
Reply
2

I'm currently building scraping bots for our project. Check if you can access the site via tor network coz that makes rotating IPs as easy as redeploying tor proxy containers in k8s cluster 😉

rmamba

·
6 years ago
·
Reply
2
We had a similar issue at Browse AI. We researched a lot of proxy service providers and eventually found 2 good ones:
- https://oxylabs.io/
- https://luminati.io/
Both are quite pricey when your data transfer is significant because they charge per GB.

p.s. I wish we were a bit further with our product features so you could you use it to build your tool! We're adding a few capabilities that you'd need over the next 3 months (public API, for example). If you're interested, you can sign up and I'll email you monthly updates.
ardalan

·
6 years ago
·
Reply
2

I wrote a Design Doc on how to scrape wikipedia using 10,000 machines such that you only fetch each URL one time and I minimize network traffic by using distributed systems techniques.

Deploying these machines across a few cloud providers and maybe using a proxy service (like other have mentioned) would get you there.

Design a Distributed Web Crawler

Let me know if you have any questions!

KevinColemanInc

·
6 years ago
·
Reply
1. 1
  
  I guess Wikipedia is just an example in your case but just in case someone else sees that. Please don't scrape Wikipedia like that. Use the official dump and don't make them work through more requests than they already do:
  
  https://dumps.wikimedia.org
  
  dewey
  
  ·
  6 years ago
  ·
  Reply
  1. 1
    
    They haven't updated their html dump for about 12 years.
    
    KevinColemanInc
    
    ·
    6 years ago
    ·
    Reply
    1. 1
      
      Why is the time stamp from yesterday then?
      
      https://dumps.wikimedia.org/enwiki/latest/
      
      dewey
      
      ·
      6 years ago
      ·
      Reply
      1. 1
        
        Ah, yeah, I can see how this is confusing for you. If you read my paper, you will see that the goal is to fetch the HTML copy of wikipedia and not needing any image content. You linked to something a bit different. Those files are in SQL and XML format.
        
        The static html dumps haven't been updated since 2008.
        
        KevinColemanInc
        
        ·
        6 years ago
        ·
        Reply
2

You're almost there.

Yes, you need to use a proxy service to help you rotate IP addresses. Here's one, but there's loads of these out there:

https://instantproxies.com/pricing/

It's then just a case of using whatever method you were using to fetch HTML, but adding in the proxy as a parameter. Most libraries for making HTTP requests will have this built in, like curl:

https://ec.haxx.se/usingcurl/usingcurl-proxies

After you have successfully grabbed the HTML then you have to parse out the data you want but I presume you already know how to do that. There are a number of HTML parsing libraries out there - e.g. in Ruby we use Nokogiri:

https://nokogiri.org

Note that if you're scraping content that doesn't want to be scraped then you're probably violating some terms of service... be warned! And you're also entering into an arms race with the owner of the platform; all it takes is for them to change their HTML in some way and your scrapers will break, let alone other techniques they could introduce like scrambling / honey pots etc.

Good luck!

yongfook

·
6 years ago
·
Reply
2

This comment was deleted 4 years ago.

DeletedUser

·
6 years ago
1. 1
  
  There are other Services which are not using this api. The huge benefit of not using the api is 1) you are able to scan private profiles to and 2) people don't need to authenticate with facebook.
  
  Pristupniik
  
  ·
  6 years ago
  ·
  Reply
  1. 1
    
    This comment was deleted 4 years ago.
    
    DeletedUser
    
    ·
    6 years ago