2
5 Comments

Where to find: AI Training Data?🤔

Hey IH,

Where do you go to find training data and models for AI?

Should there be a marketplace for this?
  1. yes
  2. no
  3. not sure
Vote
on January 13, 2023
  1. 2

    Depending on my usage, I would find my dataset in Kaggle. If I need a formal dataset (usually used in academic papers), then I would just go to a paper and find what dataset they used. Most of them are free.

    For models, again it depends on my usage. For NLP related tasks, then there is OpenAI or Midjourney. For computer vision, there are in-built models from Tensorflow or Pytorch, otherwise I would find the github of related academic paper and get the model there.

    For me, having a marketplace of these is kinda redundant. I mean it would be nice, but I'm not sure if it would add that much value. Hope you find it helpful.

  2. 1

    Kaggle already exists and there is plenty of training data there.

  3. 1

    For data:

    • I'd begin with a google search for survey papers on the task, sometimes filtering for results from the last year
    • Then I look for the benchmark datasets referenced in the paper, are they openly available?
    • Go to the site and request permission from the researchers. If possible, use academictorrents.com
    • sometimes you can use scrapy.org to crawl domains downloading images and structured data
    • if you need video, try youtube-dl
    • fine-tuning a pretrained model for a custom image classification/detection task may be possible with only a few hundred labeled images, then I use the image-downloader chrome extension
    • Use Blender, Unity, UnrealEngine to synthesize datasets like PeopleSansPeople
    • Amazon, UC Irvine, kaggle keep public data repositories that may be useful
    • Data labeling tools and/or services
    • Buying a license to a dataset

    For models:

    • We often make a tradeoff in speed versus accuracy, what are your requirements?
    • Where do you want to run the model? If its on a mobile/edge device, try a mobile-friendly backbone like the widely-supported MobileNetV2 architecture
    • What is your inferencing engine? If you run models on NVIDIA hardware, you want to look into tensorRT. If you want to run models in Intel's VPU, you will want models supported by onnx/openvino toolkit.
    • Check out the model zoos for these engines, common use cases like person detection should have high quality public models. Are your target objects represented in the VOC dataset?
    • Use transfer learning to specialize a pretrained model to your prediction task

    For some readers, the overhead of training a new model will justify using a tool like remyx.ai.

Trending on Indie Hackers
I spent $0 on marketing and got 1,200 website visitors - Here's my exact playbook User Avatar 50 comments Veo 3.1 vs Sora 2: AI Video Generation in 2025 🎬🤖 User Avatar 26 comments I built eSIMKitStore — helping travelers stay online with instant QR-based eSIMs 🌍 User Avatar 20 comments Codenhack Beta — Full Access + Referral User Avatar 20 comments 🚀 Get Your Brand Featured on FaceSeek User Avatar 18 comments Day 6 - Slow days as a solo founder User Avatar 15 comments