tl;dr: I made the Steam Top 250 in my spare time. Check it out, I guess.

Three months ago I began investigating harvesting Steam data to figure out which games are the best. It occurred to me that all the games I loved playing were often rated overwhelmingly positive by the Steam community, but actually finding those games was difficult. What if I could just write a program to find good games for me?

Day one: I put together a basic data importer to fetch data from Steam's poorly documented API. One command fetched the list of "apps" and the other fetched the review data. Enumerating the list of apps and annotating each one with review data from a separate API call each time was a slow process that took roughly nine hours to complete, but it worked.

Over the next two days I built a simple website to visualize the data and rank it using the fairly well-known Wilson algorithm. Reddit uses this very same algorithm to sort posts based on up and downvotes. I was using it to sort games based on positive and negative reviews.

But there was a problem. The list included, amongst other things, DLC, software and other non-game entries from Steam's "apps" list, and we didn't have enough information to filter them out. So once the list had been sorted, a final *decoration* stage was performed to scrape additional data, such as the app's type, from the Steam store page so we could reduce the list down to just games.

The proof of concept was complete, but the nine hour data import was too long to refresh the data every day; by the time it completed it was almost obsolete. For the next six days I investigated ways to manipulate the concurrent processing power of Travis to automate building the database in parallel chunks. The solution was build stages, a feature that's still in beta and wasn't possible just a few months ago, which allows us to process a series of jobs with different levels of concurrency at each stage.

Stage one downloads the list of apps. Stage two splits the data import into 15 separate data chunks to speed up the import. Stage three stitches all the chunks back together into a single database. Here's an example of one of the old builds. Stage two is misnamed test because it's beta and you can't rename that stage yet or maybe I just couldn't figure it out. Anyway, it's importing the data in chunks, but Travis only permits five concurrent builds, so chunks 6-10 have to wait for 1-5 to finish before they can begin, and so on. Nevertheless, it compressed the build time down from nine hours to 2-3, depending on how fast the Steam API was responding during that time.

Now, with the aid of Travis' cron feature, I had a Steam Top 250 Games list automatically downloading live Steam data and using it to generate a static website on GitHub pages once a day. I posted it to r/Steam but the feedback was that the list wasn't very good: it contained a lot of games that had a high approval rating but fairly low numbers of votes, propelling esoteric games into a list that should showcase mostly classics. I wanted to improve the Wilson algorithm but I'm not a mathematician and had no idea how it actually worked.

However, soliciting the aid of a couple of very smart people, one whom volunteered because of that very thread, we were able to come up with seven different ranking algorithms between us. It was almost too many! Not only did we have so many algorithms to choose from but each one could be weighted differently to favor either the approval rating (positive v.s. negative votes) or number of votes, on a sliding scale, giving an almost unlimited number of variants. Nevertheless, we were able to find (at least) one that worked very well for ranking the top 250.

Over the next couple of weeks many improvements were made and new features added, the most notable of which was when we threw out all the API calls and just scraped the data from the Steam store pages. This slashed the build time in half, down to just one hour typically, because it turns out Valve uses caching to great effect for their HTML but not so much their JSON API.

The entire project is open source, including all the data, so you can run some custom queries on the data. Not only that, but the whole system is built on free services: you only need a free GitHub account and a free Travis account to clone my project and run your very own ranking site. GitHub hosts the code and the static site whilst Travis imports the data and builds the database.

Since then we've set up crowdfunding integration on the popular Patreon platform to support future growth. Some of those features are still being developed and due to launch in a couple of days, such as the supporter rankings which features recommendations directly from supporters on the site. Details of the product roadmap are also on our Patreon page.