April 3, 2019

Data Engineer looking for another technical Co-Founder

I'm a long time Data Engineer and Crawling expert by trade who's amassed a massive amount of data over the last few years in the area of products and pricing (2.3 billion Amazon products with about 3 years of pricing history for at least 16 million of them) as well as real estate (Limited data but on every Property in the US) along with Business and Social data I've gathered over the years.

I've also got a pretty extensive background in SEO and understand the rules of sitemap indexing and tracking in the millions of pages in order to successfully grow organic traffic at scale. My business goals are to create massively large network of sites with all of this data and little to no client side functionality at first but that attract Google traffic (and thus ad revenue and or affiliate purchasing) in order to generate steady enough revenue at first.

I've built API's and web UI's for most of this kind of stuff for most of my career before but I'm looking for someone willing to go down the Static Site Generator (SSG) rabbit whole with me (Jekyll or Hugo) and can help focus on the User Experience and Displaying of all of this data while I focus on updating and generating new data channels. This is not an easy road to travel as it requires a lot of thinking and pre-processing ahead of time so changes to UX can be an expensive endeavor at this scale but the cost benefits are incredible long term.

I'm not against also having someone non technical reach out but my primary goal here is revenue generation primarily from ads so anyone with connections or a background in the ad sales business is welcome as well. The idea is that with each new massively data heavy site launches we can pay attention to the ones doing well and then start adding client side features to grow and increase user time on site etc. I also welcome criticism of the approach, I know a few hedge funds that do bits and pieces of this sort of stuff but no one doing it at the scale I want to get to. Let me know if anyone is interested.

#looking-for-cofounder

  1. 1

    Hi,

    Nice project.

    I am just curious what resources do you need for running the server with a db and cache behind.

    1. 2

      Well originally I built the whole thing to run on AWS using just S3 and ElasticSearch, the code for most of the collection, storage and API pieces are available for review here. https://github.com/bastosmichael/skynet though I turned off the service last year because while it was making money it wasn't doing well enough to the time and energy I was putting into it. Since then though I still have the data for everything and decided to port it all to postgres and run it locally on a 64gb machine with a 1 TB SSD which is main machine along with a rack server with 32GB of ram and 2TB drive which I use for DB machine as well as a 16GB SFF machine with another 1 TB SSD drive which is my production build and deploy machine not to mentioned my laptop which I write most of my code and do most of my testing on and that's a 32GB system with 1 TB SSD drive. All in all you're looking at a network of roughly 140GB of Ram and 3 TB SSD's and roughly 12 TB standard drives collectively. I run a 10 TB Drobo with long term data storage but it's network only so extremely slow to transfer data in and out of so really just used as a data backup utility for now.

      1. 1

        I think you just make a big mistake by choosing to host your project on AWS.

        A server with i7-8700, 64gb ram and 2x1 Tb nvme ssd is around 60$. Or a similar one with i9-990k at 73$.You probably need 3(200$ per month.).

        10 Tb storage for backup 50$.

        Do you think the static website will decrease the cost with a lot? I expect to have servers with low cpu and ram but with more ssd storage.

        I am using hexo.io as SSG. Maybe we will talk more in private.

        1. 1

          With my new implementation I don't need servers on AWS anymore so most of the processing is done locally on my personal machines first and then uploaded to S3. I would take a look at Hugo personally for both speed if you're doing millions of pages.

          1. 1

            S3 will me extremely expensive for millions of pages.

            Do you calculate the cost for a scenario?

            1. 1

              Yes and it's under $100 per month based on the amount of traffic I was getting before I shut the project down which is a huge cost savings by comparison. Billions of pages obviously is way different and can be unsustainable at that same traffic level so I'll have to deploy the total number of pages slowly and wait for the traffic to rise before I deploy more etc. Starting with 40 million pages and working my way until 100 million and will try and do 100 million page increments as time and resources permit.

  2. 1

    What's the datastore behind all of that? I ask as it's waaaaaaay better to do that from a web app, connecting to a datastore, and then just crawl + cache it so that it's trivial to keep that data updated and relevent.

    1. 1

      It's mostly postgres with elastic search as the index and searchable now running on physical hardware instead of AWS, I ran this whole thing as a web app for almost 3 years (thus the reason for the history) and it made plenty of money but it also cost a fortune to run at this size and scale (cache can be expensive) so my reasons for doing SSG and the process of updating the data once a day instead of in real time through a web app are very specific and calculated from this point. What you get in ease of development you loose long term in cost when you're talking about millions if not billions of pages and data sets. Maybe when the sites get enough traffic long term and revenue grows a bit more I can make the switch back to web app but if I'm going for biggest impact at lowest cost then caching gets used too often as the "right" solution though it's expensive at this scale but I've been doing sites at scale for a long time and the cost differentials of just storing an SSG site in S3 are the reason for my going this route in the first place.

      1. 1

        hmmm... nope, you're thinking about this totally wrong :-)

        I'm NOT suggesting you cache this in a cache store, like memcache or redis, etc...

        Cache to disk.

        When a page is rendered for the first time it writes it to an html file (in s3 or where ever you want).

        Hence my comment about Crawling + Caching.

        The Crawler just renders everything to a static site, except for maybe a few pages that significantly add to your bottom line by being dynamic.

        A Rails front-end (for example) backed by Postgres on AWS is NOT expensive at this level. If you're really tight on cash, just go to Linode and be done with it :)

        It's SUPER cheap and way easier to manage a LARGE amount of data than through Static Site Generators.

        SSG's sound like fuzzy warm unicorns prancing under rainbows...

        ...but they become a slow, kludgy nightmares when you attempt anything of real size over a period of time.

        At least, this has been my direct experience, can't speak to anyone else's.

        1. 1

          It's interesting you say that, so my first API iteration wasn't postgres based but rather a Rails App with data stored long term in S3 and index in ElasticSearch https://github.com/bastosmichael/skynet and the S3 data was cheap but the ES and the Rails servers are what got expensive over time with that much data but the crawler worked exactly like you described. I turned the whole app off because even with all of the optimizations I made and ES is just as good as any cache whether used as a fuzzy search or as a direct data link to a record in order to generate a page super fast etc. What my research told me is that Rails was my bottle neck and that if I had stored not just the data but the pages themselves in S3 and do all of my site interaction via client side only using React or something I could cut my overall bill in half if not by 2/3's. So when I started digging into SSG's I know they felt fuzzy and warm like a unicorn especially playing with jekyll while coming from a Ruby and Rails centric world for the past couple of years, but the more data I threw at it I saw the slow and kludgyness you spoke of. It was then that I discoverd Hugo and started migrating my design and build to that instead. https://forestry.io/blog/hugo-vs-jekyll-benchmark/ if you read this article you'll understand what I mean but to be honest the charts they give don't even do Hugo any real justice because it was not only faster than anything I've seen in generating my SSG, it could handle the millions of pages I mentioned like my machine was slicing butter. Granted I was very much limited on the features and what I could do compared to something like Jekyll but I felt it was well deserved by the pain of implementation by the sheer speed improvements. I haven't done a billion pages yet and the transfer costs on that is more than I'm willing to tackle right now so I'm also working on doing a proper S3 syncing tool of my own design that compares etags before pushing up new pages and I'm also trying to get over the hump of wanting to use Lunr.js as a search tool for all of this but it doesn't support gzipped json files so any json I generate for all of this data to be searched client side really limits the kind of data I can load but that's okay because my focus for now is page and sitemap generation and I can focus later on search and other more prominent features etc. And this is just one site dealing with products, I have a slew of ideas for doing the same across Real estate, Local Business listings and social data etc. If my implementation works out I'll be looking at a 1/20 cost ratio for the revenue generated by the original site etc.

          1. 2

            Gosh, that sounds really, really complicated :-)

            I like simple approaches...

            Your Rails costs + bottleneck is because you had your traffic and data flow through it.

            Mistake.

            You have to use a web server (not the Rails application server) in front of it (standard setup, nothing unique here).

            Pick Nginx for this :-)

            It's essentially a 1 line config in Nginx to say:

            "look for the requested page 'real-estate-listings.html' (or whatever page is requested) in a directory first, before you ever send this request to the Rails stack"

            So again, you're not using Rails, ElasticSearch, or anything else except on those rare occasions when you need a specific page to be dynamic. Or someones paid you a subscription fee and you provide an authenticated Dashboard, etc...

            You're serving nothing more than a flat HTML file from a speedy Nginx web server that's been pre-rendered by some web app (Rails in this case).

            Okay, you don't want a crawl you're entire site? No problem, let Google Bot (et al) do it for you, the first visit enters the Rails stack, renders a static HTML page, and dumps it to disk (or s3). Just one of a handful of ways to get around "rendering" it all yourself if you desire :-)

            I have done and do this for Petabytes of data stored in s3. My AWS Rails server and DB server costs (and throughput) are trivial.

            I spend more on coffee every month ;)

            That said, I do have an ES server, and those costs are a totally different story ;-)

            1. 1

              I don't see it as complicated if my goal is maximum optimization in order to improve revenues while decreasing overall server and data transfer costs long term. Interesting approach you described and I've done something similar in the past with another site that was around 200 million pages, you still end up with the nginx server being a bottle neck even if you skip over the Rails server portion of it for the more "dynamic" stuff as traffic increases doesn't matter how fast nginx is you'll still need multiple servers sitting behind a load balancer of some kind once you get to a few million users constantly hitting and asking for even static page data though I grant you that the approach is better than just a rails app building pages dynamically. What I'm doing is uploading the pages already pre built to S3 like in your example instead of an nginx server and putting cloudfront ontop of that before using the S3 site hosting option and pointing my CNAME at the cloudfront address etc and using Cloudflare for making sure everything is https right from the get go. The fact that the pages are already built and persisted to S3 means that Google can churn through them much faster than before and I can set my webmaster tools setting to max crawl without worrying about them taking down my servers or spinning up new machines dynamically to accomodate. The heavy lifting occurs as the pages are being generated before hand and only changes go up to S3 so my data transfer costs remain relatively low. Also because all of my page generation happens behind the scenes instead of in a live production setting I can offload much of that in the future to less expensive spot instances so I've thought about how to keep costs low while also scaling beyond my physical hardware here at home. I'll still have an API with React on the client side instead of through a node server for stuff like creating a session and or requesting page updates in real time but the point is that it's all happening on a prerendered page on S3 instead of through a server that needs to render routes etc. Again this is all mostly experimental for now but I'm taking the lessons learned over the last 3 to 4 years at keeping costs really low and reducing them even further with an architecture change. Thanks for the feedback and if you have any other questions I'd love to keep the conversation going.

  3. 1

    Wow, intense! :-)