Built a tool to catch when AI agents quietly mess up their own context

Been working with AI agents and ran into something that kept bugging me — an agent runs fine for a while, then starts repeating the same failed action over and over, or context from earlier in the conversation contradicts what it's doing now, and you only find out after the output is already garbage or the bill is way higher than expected.

Looked around at the usual tools (Langfuse, LangSmith, etc) and they're all great at telling you how many tokens you used and what it cost. None of them tell you if the context itself is actually healthy or already poisoned.

So I built StreamCtx to fill that gap. It scores context health, lets you diff any two steps to see exactly what changed or contradicted, auto-checkpoints so you can resume from where things broke instead of restarting the whole session, and compresses repeated context to cut token usage.

End result — instead of digging through logs trying to figure out why your agent went off the rails, you get a number and a diff that tells you exactly where it broke and why.

It's MIT licensed, free, all 6 core features included.

github.com/streamctx/streamctx

pip install streamctx

Curious if others building agents have run into the same thing or have a different way they catch this.

Sneh R Joshi

on June 25, 2026

Say something nice to sneh51…

Post Comment

1

This maps to the failure mode I keep seeing too: context health is useful, but it becomes much more actionable when it is tied to a decision boundary.

The extra output I would want per step is: what changed, what the agent is now assuming, and which future tool call or product decision that assumption would authorize or block. That turns "context is poisoned" from a vague warning into "this specific assumption would have caused the wrong action."

The checkpoint part is also important. A checkpoint is only safe if you can explain why it is safe to resume from it, not just that it is shorter or cheaper. Provenance on the checkpoint itself might be the difference between rollback and another hidden drift point.

zaindanaharper

·
21 hours ago
·
Reply
1

The context poisoning problem is real. I've seen agents go from producing great output to repeating the same bad pattern because the context accumulated contradictions.

One thing I've found helpful that complements a tool like this: giving agents an explicit 'working spec' that gets refreshed every N steps. Basically a structured document that says 'this is what we're building, these are the constraints, here's what we've decided so far.' It resets the context baseline so contradictions don't compound.

StreamCtx catches when it breaks. A structured spec prevents it from breaking in the first place. Both together is probably the right answer.

paradox07

·
a day ago
·
Reply
1

This is sharp. One extra signal I’d track is whether cost/token spikes correlate with context decay, not just final failures.

I’m working on TokenBar from the visibility side, and weird usage jumps are often the first hint an agent is looping or losing the plot. Your diff + checkpoint layer feels like the missing next step after you notice the spike.

JohnMadison

·
a day ago
·
Reply
1

Context feels like one of those problems people don't notice until it breaks.

Curious, what was the moment that convinced you this was worth building?

FounderFlow_57

·
a day ago
·
Reply
1

This is actually a real problem — most tools focus on cost/logging, not whether the agent is still “thinking correctly.”

The context health idea is interesting, especially if the score actually correlates with failures. That’s the hard part — making it actionable, not just another metric.

Diff + checkpointing also feels very practical. Debugging agents right now is painful, so anything that shows where it broke is valuable.

Main question: how early does it catch issues vs just explaining them after the fact? That probably decides how useful it becomes.

quill_ai

·
2 days ago
·
Reply
1

The distinction between "the agent ran" and "the agent stayed on track" really stood out.

We tend to measure latency, cost, and tokens because they're easy to see.

Context quality feels much harder to observe, but probably just as important once agents run for longer.

aryan_sinh

·
2 days ago
·
Reply