Auto start and tear down servers for crawler on a schedule

by Marcus

I am crawling a lot of data on a regular basis, but I want to do it cheaply which means that I have to limit the amount of resources used. It is simple to setup a server with a cronjob that run my crawl jobs on a daily basis, but that also means that I pay for the infrastructure 24/7.

How would you provision new infrastructure on the fly and automatically tear it down whenever a job has finished -- in a price efficient manner?

Is the simplest way to run a Kubernetes Cluster with autoscaling? It does not feel very MVP.

Marcus

posted to

Developers

on December 4, 2019

Say something nice to diptail…

Post Comment

2

What about Heroku? You can deactivate the default web dyno and then run your scripts with Heroku Scheduler. This way you'll pay only for the time your scraping scripts have consumed, or even get it for free if you consume less than 550h per month. Although, you will be limited to 512MB of RAM.

LukaszWiktor

·
7 years ago
·
Reply
1. 2
  
  I second Heroku. Quite simple to set up. And that 512MB RAM... you may need to do some memory management/optimization here and there but it should be necessary only for larger content scrapes (don't know what your target is, but if you're scraping one page at a time, persist it, and free up that memory, 512 should be plenty) :) Good luck!
  
  BetaPeak
  
  ·
  7 years ago
  ·
  Reply
2

I can almost guarantee that Kubernetes is not the simplest way.

What about serverless? I think you can run serverless functions on a schedule on most of the serverless providers. Although, long running functions might be a problem.

nprail

·
7 years ago
·
Reply
1. 1
  
  You say Kubernetes is not the simplest way and then you suggest server less as an alternative for a scraper? That sounds like a lot more work (and refactoring of this code if it's not already written to be Lambda functions) additionally to being more expensive.
  
  dewey
  
  ·
  7 years ago
  ·
  Reply
  1. 1
    
    I'm saying that Kubernetes is not very simple. Unless you already know it really well. Then maybe it is.
    
    I suggested serverless because you only pay for the time it's running so, in theory, it could be cheaper. Whereas, a Kubernetes cluster still has nodes that you have to pay for in most cases.
    
    I don't know enough about the code structure or scrapers to say that serverless should be used. I was simply suggesting it as a possibility to look into.
    
    nprail
    
    ·
    7 years ago
    ·
    Reply
1

If you are not opposed to AWS, my go-to for this is an ECS task. If you can wrap your app into a Docker container, put it on ECS as an ECS task, then you can create a CloudWatch event to trigger your ECS task on a schedule.

Docker apps do not need to run forever, your docker app can just run your scraper and exit cleanly. This way you trigger the ECS (Docker) app whenever you want, it runs for a short period (without the 15min Lambda limitation) and bobs your proverbial uncle!

This is how I would do it.

Cheers!

johneke

·
7 years ago
·
Reply
1

I use aws lambda if you can split it up in 15 minute batches. Haven't looked into aws batch yet, but seems suited for the task.

quodlibet

·
7 years ago
·
Reply
1

Apologies if I’m being basic, but if you’re just trying to get an MVP done, why can’t you run this on your personal computer or a raspberry pi on your home network? It’ll certainly be cheaper than any cloud solution.

If you do need crawl from the web (or your at-home compute resources aren’t sufficient) check out AWS Fargate: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/scheduled_tasks.html

jamilbk

·
7 years ago
·
Reply
1. 1
  
  Well, you are right, that as a proof of concept it is probably enough to run the crawler from my home machine for a few days and dump the data to a server once and treat it as stale data just to build the application.
  
  That's actually what I'm doing right now for development.
  
  The thing is that I'm crawling thousands of websites and millions of pages. So for a "working product" (a product that have updated data and not just stale), it would need it to be done on a daily, weekly or monthly schedule.
  
  I think ECS is a quite good option except that my experience with AWS is that things get expensive quite quickly. I'll look into it more. Thanks.
  
  diptail
  
  ·
  7 years ago
  ·
  Reply
1
I think there are multiple options, that also depend a bit on what you mean by "lot of data".
1. Use a provider with an API and hourly billing (DigitalOcean, Linode, AWS, GCP) and just boot up machines, pull your code and run it, then kill the machine again. That's what the cloud was made for after all.
2. Don't use a cloud provider but go with a bare-metal provider like OVH, Kimsufi, Hetzner where you get a lot of resources for very little money. If you don't need the flexibility of the cloud go with that. I'm using a Kimsufi server that costs me less than 10 Euros / month and it has an i5, 2TB of disk space and 16GB of memory. That should be more than enough to run a very large scraper.
dewey

·
7 years ago
·
Reply
1. 1
  
  I agree - I use Linode and use their API to at least reboot servers automatically. They're actually so cheap that I don't mind having them 24/7.
  
  But if really needed and 5USD/month is too much then you can indeed spin one up with the API and then get rid of it again once you're done (also with the API).
  
  TkdChamp1
  
  ·
  7 years ago
  ·
  Reply
1

This comment was deleted 7 years ago.

_Vicente

·
7 years ago