I built a multi-model code reviewer for $0.30 a run. Here's what it caught.

by Bambushu

After shipping a few side projects this year I noticed a pattern. When I asked one model to review my code, it found the obvious stuff. When I asked a second model, it found different obvious stuff. When I asked a third, same problem. Each model has its own blind spots, and the blind spots correlate with model family.

So I built Crucible. It's a Claude Code skill that puts every file through a panel of four frontier models from different vendor families. DeepSeek, Gemini, Kimi, MiniMax. Each model sees the previous one's findings and either validates them, contests them, or adds new ones. At the end, Claude reads the whole report back against the actual source and tells me which findings are real and which were hallucinated.

That last step is the one that matters. Three models converging on the same hallucination is still a hallucination. The verification phase is what makes the report something I can act on without re-reading every file myself.

Free, open source, MIT licensed. Drop the folder into ~/.claude/skills/, set your OpenRouter key, restart Claude Code.

Repo: https://github.com/Bambushu/crucible

Bambushu

posted to

Developers

on May 5, 2026

Say something nice to Bambushu…

Post Comment

1

The verification layer concept is the key insight here. I use AI tools extensively in my marketing workflows (not for code review specifically, but for content research, competitive analysis, data extraction from messy sources). And the single biggest problem is confident hallucination.

I've been running a similar multi-model approach for research tasks. Ask one model to pull competitor data, ask a different model family to verify it against source material, then have a third pass judge which findings are actually supported. The disagreement between models is where the signal lives. When Claude says X and Gemini says not-X, that's the exact spot I need to go verify manually.

The $0.30 per run pricing is interesting because it reframes the cost question. Most people think about AI tool costs monthly. But per-run pricing makes it a no-brainer comparison: would you pay 30 cents to catch a bug that would take you 2 hours to find in production? Obviously yes. That framing alone could be your marketing angle.

One question: how does the sequential chain handle the scenario where the first model's findings are so wrong that they poison the subsequent models' analysis? I've seen this in content research where one model hallucinates a data point and then the next model just accepts it as given context. Does the verification layer catch that reliably?

ChrisLeo

·
2 hours ago
·
Reply
1

Single-model review usually fails in one of two ways:

It misses real issues because the model is overconfident in its own reasoning.
Or it finds plausible nonsense and presents it with enough confidence to waste your time.

The second pass helps.
The verification layer is the real product.

That’s the part most “AI code review” tools still miss.

The value is not more model opinions.
It’s forcing adversarial disagreement, then collapsing that into something a developer can trust enough to act on.

That also makes Crucible a much stronger name than most AI-dev tooling here.
Short, technical, and appropriately opinionated.

If you push this beyond Claude Code users, Vroth.com is the strongest upgrade.
It carries the same hard-edge / infra feel, but scales better as a standalone product than Crucible does.

aryan_sinh

·
6 hours ago
·
Reply