2
2 Comments

Crawling and hacking datasets - share your stories!

I have taken a few stabs at this over the years and concluded the following:

  • Building datasets is often challenging. It usually takes time to build one and maintaining and verifying the data can be a challenge depending on the topic and how it is done.

  • The flip side of this is someone coming along and copying the dataset for own purposes. However, if they do not have the same technology that was used to gather it in the first place, they face the issue of maintaining the data going forward.

  • When data that is freely available and easy to replicate, verify, collect, and does not need to be updated, it is not that valuable.

  • Having a high quality dataset and a unique process for maintaining it can be a significant advantage and perhaps valuable enough to be monetized. Basic use case can be e.g. app for recognizing various mushrooms while out in the woods, which I know has been done successfully and monetized.

Having dynamic content and users who update that data is possibly the best case scenario for the purposes of making it difficult to copy, but at the same time there are many use cases of more static data being presented in valuable ways e.g. IMDb, App Annie, similar web, SpyFu, Wikipedia etc.

I would like to hear from indiehackers who are working with datasets in some way. What are some ways that help you tackle updates or verify data accuracy, and how are you presenting the data and monetizing it.

A few of my stories from along the years:

  • My friend and I wanted to build an app for food trucks. We were able to gather an initial dataset, but it was mostly outdated and maintaining the data would have been extremely difficult. We would have needed users to update their own data, but getting initial traction was hard so we abandoned this project and moved on.

  • I wanted to hack chrome web store so I built a crawler that collects data about new chrome extensions. I have been running that crawler for a few years now. Consequently I know quite a bit about the store and I have large dataset of items published on there in the past few years. I haven't done anything with this beyond just collecting. A few other people also collect data about this same topic and share it online, so I do not look at it as particularly valuable except maybe the email addresses that come with it.

  • I built a twitter bot to collect tweets from a specific account. I was doing sentiment analysis on those tweets. This was mainly out of curiosity. I stopped doing the analysis but still have the crawler running because, why not.

  • I wanted to build something around ML - specifically NLP. I dug into the the code of a major Finnish newspaper and figured out how to read their API. I have been collecting the headlines for about 2 years. I share this on GH but no one cares, oh well :D

All my crawlers are automated and work in some chronological fashion. I build them and leave to run. I haven't had any success building anything that requires consistent maintenance and am curious to hear some successful strategies of how to do this OR how to get other people to do it for you. In fact I was in the process of working on a dataset which is what prompted me to write this in the first place.

What kind of stories do you all have?

Trending on Indie Hackers
How I grew a side project to 100k Unique Visitors in 7 days with 0 audience 47 comments Competing with Product Hunt: a month later 33 comments Why do you hate marketing? 27 comments $15k revenues in <4 months as a solopreneur 14 comments Use Your Product 13 comments How I Launched FrontendEase 13 comments