60% of AI code fails silently. Here's why we built around that.

A bit of context: we make Ziva, an AI agent that writes Godot code. We've been heads-down for ~5 months, and the thing that has kept shaping the product more than anything else is a single number from this year's Sonarsource State of Code report: 60% of faults in AI-generated code are "silent failures." Code that compiles, looks right when you skim it, passes whatever surface tests it has, then does the wrong thing in production.

Why this matters more for game dev than other domains: in a webapp, a wrong type bombs out at the API boundary. You see a 500 in your dashboard. Sentry pages someone. The failure is loud.

In Godot, the equivalent of that 500 error often doesn't fire at all. The node.connect() call references a target that gets freed two frames later. The load("res://scenes/Player.tscn") returns null because the path is wrong. The autoload reference fails because the autoload was never registered in project settings. None of those throw. None of those leave a stack trace. The game just keeps running with a quietly broken behavior, and you don't find out until a player on Steam asks why the death screen never shows up.

We saw this enough times in our own dev that we kept asking the same question: why are AI tools writing code without ever running it?

The wedge we picked

The standard AI coding tool model, as of mid-2026, is text-in / text-out. The model edits your files. It might run a CLI command if you give it shell access. But it does not open Godot. It does not press Play. It does not see the output panel. The dev/AI/dev loop is something like:

AI writes code.
You paste it.
You run the game.
You see a problem.
You copy the error back to the AI.
AI guesses at a fix.
Repeat.

In our user logs the median number of round trips for a non-trivial Godot bug was around 5. That's 5 context switches per bug. We're indie devs, our context switching budget is limited, and most of us are juggling code, art, music, and marketing in the same week. Five context switches per bug is a productivity tax we don't want to pay.

So the question for the product was: what does it look like to remove steps 2 through 5?

What we actually shipped

Short version: we wired Ziva up to Godot directly. The agent has an MCP API into the editor. It can:

Spawn the Godot scene via the test runner.
Load a scene file by path.
Fire input events into a running scene.
Read the output panel. Including warnings, not just errors.
Read scene tree state (which nodes exist, what their parents are, what parameters/playback is currently set to in an AnimationTree).
Stop the scene and try a different code path.

The same agent that writes the GDScript runs the GDScript. If the change broke the signal wiring, the agent sees the missing connection in the output. If the change put the player in the wrong autoload, the agent sees the NIL reference. The fix happens inside the agent's loop, not yours.

The user's first interaction with this is genuinely strange. You ask for "a save system that handles a custom resource." With most AI tools, you get a reply like "here's the code, paste it and let me know if it works." With Ziva, the reply is more like "I added save/load, ran a save and a reload, hit a parser error because the resource was missing class_name, added that, re-ran, save system is working." The reply takes longer to come back. The thing it produces actually works.

What we got wrong on the way

A few things, briefly, since the IH community appreciates the honest version:

We thought the wedge was code generation. It's not. It's code execution. Plenty of AI tools generate decent GDScript. None of them run it. The differentiated thing is the runtime loop, not the model.

We wasted ~6 weeks trying to be cross-engine. Early on we had hand-wavy plans to support Unity and Unreal too. The actual product surface for "AI agent runs your code in the engine" is so engine-specific that staying narrow on Godot was the only realistic path to shipping anything in 2026. Godot has 8% of the engine market, but 8% of the engine market is still tens of thousands of devs, and the long-tail Godot growth (Slay the Spire 2 was Godot, Godot 4.6 just landed with Jolt as default 3D physics) makes the niche bigger every quarter.

Our first pricing was too cute. We had a multi-tiered subscription with a free trial, a hobbyist tier, and a pro tier. Hobbyist tier converted at near zero. We collapsed it into a single free trial + one paid tier. Conversions doubled. Founders kept telling us this and we kept ignoring them; it took our own data to make us listen.

We underestimated the importance of testing the AI's testing. When you're shipping an AI tool that runs code in an engine, the failure modes are layered: model output can be wrong, the engine integration can be wrong, AND the thing it tested can be wrong. We invested heavily in our own E2E test suite for the agent itself. That has paid back more than any other infra investment.

What we'd tell another founder building in this space

Pick the smallest ecosystem where the silent-failure problem is the most painful, and go all in on the runtime loop, not the model. Models are getting better fast and that's not your moat. The integration with the runtime is the moat, and it's also the thing your users will tell their friends about, because the experience of "the AI ran the code itself" is the thing that breaks them out of the dev/AI/dev loop they've been resigned to for two years.

For Godot specifically, that turned out to be enough of a wedge to build a real product around. If we were doing this in 2024 we would have been too early. If we waited until 2027 someone else would have done it. We seem to have hit the right window by accident; not because we read the market well, but because we were Godot devs first and got tired of pasting AI code into the editor and watching it almost work.

If you're building in this space (or thinking about it), we're on Twitter at @Ziva_Sh and the product is at ziva.sh. Happy to compare notes.