As a first time founder dipping into AI data infrastructure, I'm quietly building a structured coding dataset (novel concept) aimed at bridging gaps in Code LLMs. It's graph-based analysis pulled from real world repos designed to help models reason better about code structure and reduce those frustrating hallucinations in generation tasks.
We found there's no such cross-language dataset that captures relational dependencies (like function calls, variable flows, and module interactions) in a clean, queryable graph format.
Last week, I shared an early version with one researcher (LinYi - Assistant Professor at Simon Farser University), and their feedback was encouraging: "This has real potential to improve LLM performance on code reasoning benchmarks." It's validating, but as a solo bootstrapper with no prior network, the path forward feels steep. The dataset isn't polished enough for GitHub or Hugging Face yet (still stabilizing the schema), and scaling it say, to cover more languages or 10x the volume requires high-end compute which I don't have access to right now.
What to expect from something like this? Expect iterations: messy data cleaning, small wins from tester insights, and the slow grind of validating without resources. But also expect impact quality datasets are the unsung heroes in making Code LLMs more reliable for devs everywhere.
If you're a tester interested in early access, DM me. And if you're an investor or angel curious about AI data plays, I'd value a quick chat on the space.
hashtag#AI hashtag#AIDatasets hashtag#CodeLLMs hashtag#FounderJourney hashtag#OpenSourceAI hashtag#fundraising