Six months ago, I watched an engineer spend 90 minutes debugging a production error at 2am.
The root cause?
That moment stuck with me.
Not because the bug was difficult. Not because the engineer wasn't good.
But because almost all of that time was spent finding the problem, not fixing it.
So I started building one.
Today it's live at debugcause.com.
DebugCause is a self-hosted AI debugging platform designed to go from production error → validated fix proposal.
The pipeline has six stages:
1. Ingest
Connects to Loki, Datadog, CloudWatch, Sentry, log files, or webhooks.
No new infrastructure required.
2. Detection & Deduplication
Parses stack traces, extracts errors, generates stable fingerprints, and collapses duplicates. If the same exception occurs 1,000 times, the system investigates it once. Cost scales with unique bugs rather than raw log volume.
3. Investigation
The agent works against the actual indexed codebase.
It can:
Instead of generating an answer immediately, it builds a root-cause theory iteratively, much closer to how a senior engineer would debug.
4. Fix Validation
The agent generates a diff, proposes a fix, creates a regression test, and validates everything before producing a recommendation.
A second independent AI pass reviews the diagnosis.
5. Notification
Results are routed by confidence.
High confidence can create a draft GitHub PR.
Lower confidence gets routed for review.
6. Dashboard
Every investigation, report, confidence score, diff, and regression test is visible from a single interface.
I recently stress-tested the system against httpie/cli, a real open-source Python project.
No synthetic examples.
No toy repositories.
Three runs taught me more than months of development.
The agent found the root cause, generated a one-line fix, produced a working test, and completed the investigation in around three minutes.
Confidence: 70/100.
It wasn't perfect.
The diagnosis was correct, but one implementation detail differed from the eventual upstream approach.
That nuance showed up in the confidence breakdown before any human reviewed it.
This became my favorite run.
The traceback referenced a file that didn't exist in the indexed codebase.
Many AI systems would confidently invent a solution.
DebugCause didn't.
Instead it:
A second validation pass reached the same conclusion.
No hallucinated code.
No fake certainty.
Just an honest answer.
Same issue.
Same agent.
One change.
I re-indexed the correct release version of the repository.
The result was completely different.
The agent generated a clean one-line fix, produced a regression test, validated the change, and finished in under four minutes.
That experience reinforced something important:
Repository context matters more than model choice.
So far the system has been validated using:
The most important metric isn't whether the agent writes code.
It's whether the agent knows when not to.
I'm opening access to the first two engineering teams running Python backends.
What you'll get:
What I want in return:
No credit card.
No sales process.
No obligation.
Just trying to learn whether this solves a problem worth solving.
If you're interested, leave a comment or send me a DM.
I'd love to hear what I'm missing.
More details and the full technical write-up are available at debugcause.com.
Read the three-run proof section, and the 20/100 honest refusal is the most convincing thing on the page.
A tool that declines to invent a diff is exactly what makes me trust the ones it does produce. You also already closed the two boundaries most people get wrong: self-hosted means the code stays in my infra, and BYO-LLM plus no telemetry means I decide whether trace content ever leaves. Credit for that, it is the part teams actually ask about.
The one surface I did not see handled is what is inside the traces themselves.
Production stack traces and the surrounding log context routinely carry secrets and PII: auth tokens, session cookies, user emails, request bodies. Two places that leaks even with everything self-hosted. If someone points BYO-LLM at a hosted API for quality, the trace content goes with it.
And the GitHub PR output, which for most teams is a third-party SaaS outside the self-hosted boundary, can quote a trace line or a fixture that contains a live secret straight into a PR description or commit. A secret-scan or redaction pass on ingested traces before they reach the model or the PR would close it, and "we scrub secrets out of traces before they touch the LLM or your PR" is a strong line you could be making and currently are not.
Smaller one: confidence-routed auto-PRs are great.
The case I would keep a human gate on is auth, payments, and migration code, where a regression test the pipeline wrote itself can pass while masking the real bug.
The second AI pass helps, but it is reviewing its own family's work. Solid project overall.
Honestly, what makes me hesitate isn't the debugging workflow itself.
It's that the same results could justify a few very different businesses.
One team might see this as an AI debugging tool.
Another might see it as an incident-response tool.
Another might see it as a confidence layer that helps engineers decide what deserves attention in the first place.
I've seen founders get strong early validation and still end up optimizing around the wrong interpretation of why people cared.
That's probably the question I'd spend the most time with while talking to design partners.
Run 2 seems quite interesting. Congrats on the launch!
This is genuinely impressive work. What stands out isn’t just the debugging automation it’s the emphasis on uncertainty and validation. The fact that the system refuses to generate fixes when context is missing, assigns low confidence to weak conclusions, and validates its own recommendations addresses one of the biggest concerns with AI-assisted engineering today. The Run 2 example was particularly compelling because it demonstrated restraint rather than overconfidence. Building a tool that knows when not to act is often harder than building one that does. Looking forward to seeing how it performs in real production environments.
Thanks, I really appreciate that.
The next step is exactly what you mentioned: validating these assumptions on real production workloads and seeing where the system breaks. I'm expecting to learn a lot once it starts encountering issues that aren't part of controlled testing.
Thanks again for taking the time to read through it and for calling out that aspect specifically.