4
11 Comments

Auto start and tear down servers for crawler on a schedule

I am crawling a lot of data on a regular basis, but I want to do it cheaply which means that I have to limit the amount of resources used. It is simple to setup a server with a cronjob that run my crawl jobs on a daily basis, but that also means that I pay for the infrastructure 24/7.

How would you provision new infrastructure on the fly and automatically tear it down whenever a job has finished -- in a price efficient manner?

Is the simplest way to run a Kubernetes Cluster with autoscaling? It does not feel very MVP.

posted to Icon for group Developers
Developers
on December 4, 2019
  1. 2

    What about Heroku? You can deactivate the default web dyno and then run your scripts with Heroku Scheduler. This way you'll pay only for the time your scraping scripts have consumed, or even get it for free if you consume less than 550h per month. Although, you will be limited to 512MB of RAM.

    1. 2

      I second Heroku. Quite simple to set up. And that 512MB RAM... you may need to do some memory management/optimization here and there but it should be necessary only for larger content scrapes (don't know what your target is, but if you're scraping one page at a time, persist it, and free up that memory, 512 should be plenty) :) Good luck!

  2. 2

    I can almost guarantee that Kubernetes is not the simplest way.

    What about serverless? I think you can run serverless functions on a schedule on most of the serverless providers. Although, long running functions might be a problem.

    1. 1

      You say Kubernetes is not the simplest way and then you suggest server less as an alternative for a scraper? That sounds like a lot more work (and refactoring of this code if it's not already written to be Lambda functions) additionally to being more expensive.

      1. 1

        I'm saying that Kubernetes is not very simple. Unless you already know it really well. Then maybe it is.

        I suggested serverless because you only pay for the time it's running so, in theory, it could be cheaper. Whereas, a Kubernetes cluster still has nodes that you have to pay for in most cases.

        I don't know enough about the code structure or scrapers to say that serverless should be used. I was simply suggesting it as a possibility to look into.

  3. 1

    If you are not opposed to AWS, my go-to for this is an ECS task. If you can wrap your app into a Docker container, put it on ECS as an ECS task, then you can create a CloudWatch event to trigger your ECS task on a schedule.

    Docker apps do not need to run forever, your docker app can just run your scraper and exit cleanly. This way you trigger the ECS (Docker) app whenever you want, it runs for a short period (without the 15min Lambda limitation) and bobs your proverbial uncle!

    This is how I would do it.

    Cheers!

  4. 1

    I use aws lambda if you can split it up in 15 minute batches. Haven't looked into aws batch yet, but seems suited for the task.

  5. 1

    Apologies if I’m being basic, but if you’re just trying to get an MVP done, why can’t you run this on your personal computer or a raspberry pi on your home network? It’ll certainly be cheaper than any cloud solution.

    If you do need crawl from the web (or your at-home compute resources aren’t sufficient) check out AWS Fargate: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/scheduled_tasks.html

    1. 1

      Well, you are right, that as a proof of concept it is probably enough to run the crawler from my home machine for a few days and dump the data to a server once and treat it as stale data just to build the application.

      That's actually what I'm doing right now for development.

      The thing is that I'm crawling thousands of websites and millions of pages. So for a "working product" (a product that have updated data and not just stale), it would need it to be done on a daily, weekly or monthly schedule.

      I think ECS is a quite good option except that my experience with AWS is that things get expensive quite quickly. I'll look into it more. Thanks.

  6. 1

    I think there are multiple options, that also depend a bit on what you mean by "lot of data".

    1. Use a provider with an API and hourly billing (DigitalOcean, Linode, AWS, GCP) and just boot up machines, pull your code and run it, then kill the machine again. That's what the cloud was made for after all.

    2. Don't use a cloud provider but go with a bare-metal provider like OVH, Kimsufi, Hetzner where you get a lot of resources for very little money. If you don't need the flexibility of the cloud go with that. I'm using a Kimsufi server that costs me less than 10 Euros / month and it has an i5, 2TB of disk space and 16GB of memory. That should be more than enough to run a very large scraper.

    1. 1

      I agree - I use Linode and use their API to at least reboot servers automatically. They're actually so cheap that I don't mind having them 24/7.

      But if really needed and 5USD/month is too much then you can indeed spin one up with the API and then get rid of it again once you're done (also with the API).

  7. 1

    This comment was deleted 7 years ago.

Trending on Indie Hackers
Hi IH — quick update. The MVP is live. User Avatar 33 comments Building ExpenseSpy solo, no funding — launching June 17 on iOS & Android User Avatar 26 comments Day 7: 51 people answered my question. I wasn't ready for what they said. User Avatar 18 comments I Built a Football Sentiment Platform in 18 Days. The World Cup Starts in 7 Days. Now I Need Distribution. User Avatar 17 comments Built an n8n booking alert system — is cold outreach dead for B2B micro-tools? User Avatar 16 comments I built a $5/1k-listing CRE data API because CoStar is overkill for first-pass scans User Avatar 14 comments