Where to find: AI Training Data?🤔

by ericlamcrypto

Hey IH,

Where do you go to find training data and models for AI?

yes
no
not sure

on January 13, 2023

Say something nice to ericlamcrypto7…

2

Depending on my usage, I would find my dataset in Kaggle. If I need a formal dataset (usually used in academic papers), then I would just go to a paper and find what dataset they used. Most of them are free.

For models, again it depends on my usage. For NLP related tasks, then there is OpenAI or Midjourney. For computer vision, there are in-built models from Tensorflow or Pytorch, otherwise I would find the github of related academic paper and get the model there.

For me, having a marketplace of these is kinda redundant. I mean it would be nice, but I'm not sure if it would add that much value. Hope you find it helpful.

codybryy

·
3 years ago
·
Reply
1

Kaggle already exists and there is plenty of training data there.

RichardGao

·
3 years ago
·
Reply
1. 1
  
  thanks! noted
  
  ericlamcrypto7
  
  ·
  3 years ago
  ·
  Reply
1
For data:
- I'd begin with a google search for survey papers on the task, sometimes filtering for results from the last year
- Then I look for the benchmark datasets referenced in the paper, are they openly available?
- Go to the site and request permission from the researchers. If possible, use academictorrents.com
- sometimes you can use scrapy.org to crawl domains downloading images and structured data
- if you need video, try youtube-dl
- fine-tuning a pretrained model for a custom image classification/detection task may be possible with only a few hundred labeled images, then I use the image-downloader chrome extension
- Use Blender, Unity, UnrealEngine to synthesize datasets like PeopleSansPeople
- Amazon, UC Irvine, kaggle keep public data repositories that may be useful
- Data labeling tools and/or services
- Buying a license to a dataset
For models:
- We often make a tradeoff in speed versus accuracy, what are your requirements?
- Where do you want to run the model? If its on a mobile/edge device, try a mobile-friendly backbone like the widely-supported MobileNetV2 architecture
- What is your inferencing engine? If you run models on NVIDIA hardware, you want to look into tensorRT. If you want to run models in Intel's VPU, you will want models supported by onnx/openvino toolkit.
- Check out the model zoos for these engines, common use cases like person detection should have high quality public models. Are your target objects represented in the VOC dataset?
- Use transfer learning to specialize a pretrained model to your prediction task
For some readers, the overhead of training a new model will justify using a tool like remyx.ai.
terry_remyx

·
3 years ago
·
Reply
1. 1
  
  thank you
  
  ericlamcrypto7
  
  ·
  3 years ago
  ·
  Reply

Trending on Indie Hackers

I spent $0 on marketing and got 1,200 website visitors - Here's my exact playbook

50 comments Veo 3.1 vs Sora 2: AI Video Generation in 2025 🎬🤖

26 comments I built eSIMKitStore — helping travelers stay online with instant QR-based eSIMs 🌍

20 comments Codenhack Beta — Full Access + Referral

20 comments 🚀 Get Your Brand Featured on FaceSeek

18 comments Day 6 - Slow days as a solo founder