1
0 Comments

Building a Code Knowledge Graph

As a first time founder dipping into AI data infrastructure, I'm quietly building a structured coding dataset (novel concept) aimed at bridging gaps in Code LLMs. It's graph-based analysis pulled from real world repos designed to help models reason better about code structure and reduce those frustrating hallucinations in generation tasks.

We found there's no such cross-language dataset that captures relational dependencies (like function calls, variable flows, and module interactions) in a clean, queryable graph format.

Last week, I shared an early version with one researcher (LinYi - Assistant Professor at Simon Farser University), and their feedback was encouraging: "This has real potential to improve LLM performance on code reasoning benchmarks." It's validating, but as a solo bootstrapper with no prior network, the path forward feels steep. The dataset isn't polished enough for GitHub or Hugging Face yet (still stabilizing the schema), and scaling it say, to cover more languages or 10x the volume requires high-end compute which I don't have access to right now.

What to expect from something like this? Expect iterations: messy data cleaning, small wins from tester insights, and the slow grind of validating without resources. But also expect impact quality datasets are the unsung heroes in making Code LLMs more reliable for devs everywhere.

If you're a tester interested in early access, DM me. And if you're an investor or angel curious about AI data plays, I'd value a quick chat on the space.

hashtag#AI hashtag#AIDatasets hashtag#CodeLLMs hashtag#FounderJourney hashtag#OpenSourceAI hashtag#fundraising

on October 5, 2025
Trending on Indie Hackers
Why Most Startup Product Descriptions Fail (And How to Fix Yours) User Avatar 100 comments We just hit our first 35 users in week one of our beta User Avatar 44 comments From Ideas to a Content Factory: The Rise of SuperMaker AI User Avatar 27 comments Why Early-Stage Founders Should Consider Skipping Prior Art Searches for Their Patent Applications User Avatar 20 comments What Really Matters When Building an AI Platform? User Avatar 17 comments Codenhack Beta — Full Access + Referral User Avatar 17 comments