Product Development July 11, 2020

Crawling and hacking datasets - share your stories!

Neea @neea

I have taken a few stabs at this over the years and concluded the following:

  • Building datasets is often challenging. It usually takes time to build one and maintaining and verifying the data can be a challenge depending on the topic and how it is done.

  • The flip side of this is someone coming along and copying the dataset for own purposes. However, if they do not have the same technology that was used to gather it in the first place, they face the issue of maintaining the data going forward.

  • When data that is freely available and easy to replicate, verify, collect, and does not need to be updated, it is not that valuable.

  • Having a high quality dataset and a unique process for maintaining it can be a significant advantage and perhaps valuable enough to be monetized. Basic use case can be e.g. app for recognizing various mushrooms while out in the woods, which I know has been done successfully and monetized.

Having dynamic content and users who update that data is possibly the best case scenario for the purposes of making it difficult to copy, but at the same time there are many use cases of more static data being presented in valuable ways e.g. IMDb, App Annie, similar web, SpyFu, Wikipedia etc.

I would like to hear from indiehackers who are working with datasets in some way. What are some ways that help you tackle updates or verify data accuracy, and how are you presenting the data and monetizing it.

A few of my stories from along the years:

  • My friend and I wanted to build an app for food trucks. We were able to gather an initial dataset, but it was mostly outdated and maintaining the data would have been extremely difficult. We would have needed users to update their own data, but getting initial traction was hard so we abandoned this project and moved on.

  • I wanted to hack chrome web store so I built a crawler that collects data about new chrome extensions. I have been running that crawler for a few years now. Consequently I know quite a bit about the store and I have large dataset of items published on there in the past few years. I haven't done anything with this beyond just collecting. A few other people also collect data about this same topic and share it online, so I do not look at it as particularly valuable except maybe the email addresses that come with it.

  • I built a twitter bot to collect tweets from a specific account. I was doing sentiment analysis on those tweets. This was mainly out of curiosity. I stopped doing the analysis but still have the crawler running because, why not.

  • I wanted to build something around ML - specifically NLP. I dug into the the code of a major Finnish newspaper and figured out how to read their API. I have been collecting the headlines for about 2 years. I share this on GH but no one cares, oh well :D

All my crawlers are automated and work in some chronological fashion. I build them and leave to run. I haven't had any success building anything that requires consistent maintenance and am curious to hear some successful strategies of how to do this OR how to get other people to do it for you. In fact I was in the process of working on a dataset which is what prompted me to write this in the first place.

What kind of stories do you all have?

  1. 1

    I've done a bunch of work with geographic datasets. Lots of year-over-year updates.

    Merging of new datasets with different metadata.

    Partial updates of prior datasets (sometimes re-introducing errors that had been previously scrubbed).

    Data normalization, oh goodness. oscar_flashback.jpg

    One of our best practices was keeping process metadata right there with the actual data. We added fields to track operations. That also forced us to work out decent process conventions.

    When postgres added JSON support, that helped immensely :)

  2. 1

    what kind of stories... not very interesting ones... you crawl they change or block, you catch and update... over and over. The ultimate goal is you want them to have their data in your dataset.

    I've ran into a couple interesting traps, that may or may not be legal.

    1. 1

      But when you crawl are you attempting to add some value to the data that you are gathering or just to collect it as-is?

      For example you could add value by taking multiple sources and combining the data in some novel way.

      I tend to do things in chronological way because, it adds the timeline axis to the data which may be absent in the original dataset. Then if you weren't able to capture it at the time it was available you have effectively missed it.

      1. 1

        Yes, I always add in value but not at crawl, and/or combine sources. Adding thing's like custom taxonomy (including date), enhancing the data, and scoring are great way to add value. These things usually happen in the processing pipeline, realtime in search, or sometimes in the UI.

        1. 1

          Interesting. I think this is similar to musical artists making remixes and then producing something even greater as output.

          Do you build apps or APIs, or what do you do with the data? I'm really interested in finding some ways to offer this information I have collected in some meaningful ways, but I've had zero successes so far. Not sure what would be a good strategy?

          I am working on a dataset right now and it is taking me quite a bit of time to put together because it involves doing a lot of manual validation. I plan to use it to maybe build some apps, maybe some digital products. Either way I want to try to monetize it.

          I would love to hear if you have any successful experiences or strategies that have worked for you, and would like to share.

          1. 1

            I suppose that's an adapt analogy, I look it more as how a search engine does it. I'm working with automotive data, building a search engine for automotive classified listings,

            What you do with the data, will depend on the market you're in and demand. SO I can't really answer that for you. But if you build an app, you're going to have to build an API, so that an decent way to validate how useful your dataset is.

            I recommend validating as much of the data with machine learning, natural language processing, or a decision tree.

      2. 1

        This comment was deleted a month ago.