I’ve been working on some AI workflows/agent-style setups, and I keep running into the same issue.
Everything looks stable at first:
But over time (or after interruptions), things start to drift:
What’s weird is:
there’s no clear failure point.
It’s not crashing—it just slowly degrades.
I’ve tried:
Those help at first, but don’t seem to actually bring the system back once it drifts.
Curious if others building with LLMs have run into this—and if you’ve found anything that actually holds up over longer runs.
This usually isn’t “drift” in the way people think.
It’s the system slowly exposing that the task itself isn’t tightly defined.
Strong prompts can hold structure early on,
but if the expected output isn’t extremely well-bounded,
the model starts filling gaps differently over time.
So it looks like degradation…
but it’s actually ambiguity compounding.
You can tighten prompts, add constraints, monitor more —
but none of that fully fixes it if the system hasn’t clearly “decided” what it should produce every time.
In a lot of cases, the real fix isn’t prompt-level —
it’s restructuring the output into something more deterministic
(e.g. enforced formats, intermediate steps, or even moving parts of it out of text entirely).
Curious — are you expecting mostly free-form outputs,
or something that could realistically be reduced to a fixed structure?
Yeah both of those are valid points — I’ve run into both ambiguity and context effects depending on the setup.
What’s been interesting to me is that even when the task is tightly defined and you control for context (fresh runs, bounded inputs, etc.), there still seems to be a gradual shift in behavior over repeated executions.
It’s subtle — not enough to fail constraints or break structure immediately — but enough that consistency starts to degrade over time.
So it feels like those factors explain part of it, but not all of it.
Have you seen cases where everything is tightly scoped and isolated, but the outputs still slowly diverge anyway?
Yeah — I’ve seen that even with tightly scoped tasks.
Feels less like “drift” and more like the distribution of decisions shifting over runs — still valid, but less consistent.
Which makes me think prompts hit a ceiling — we’re expecting deterministic behavior from a probabilistic system.
Curious — have you tried breaking it into smaller fixed steps instead of a single pass?
Yeah that’s a really good way to describe it — the distribution shifting while still staying “valid.”
That’s actually what made it interesting to me, because it doesn’t behave like pure randomness — it’s more like the system gradually moving away from an implicit reference over time.
Breaking it into smaller steps definitely helps constrain it, but it feels more like reducing the surface area than actually stabilizing the underlying behavior.
Makes me wonder whether it’s less about determinism vs probabilistic output, and more about whether there’s any mechanism keeping the system anchored as it evolves.
Have you seen anything that actually holds that consistency over longer runs without continually tightening constraints?
You need to match his level + create curiosity gap
Use this:
That “distribution shifting” framing is exactly what I’ve been seeing.
What’s interesting is — most fixes try to constrain the output harder, but that just masks it short-term.
The few setups I’ve seen hold longer usually introduce some form of anchoring outside the model itself (state, validation layers, or feedback loops).
Still exploring this — feels like there’s a deeper pattern here.
Are you mostly working on agent workflows or more controlled pipelines?
The slow degradation pattern you're describing is worth separating into two distinct failure modes: context window drift, where earlier instructions get deprioritized as the conversation grows, versus model version drift, where an API update silently changes behavior. Both feel the same from the outside but have different fixes. Have you ruled out context window pressure as the cause by testing with fresh sessions at fixed intervals?
Yeah both of those are valid points — I’ve run into both ambiguity and context effects depending on the setup.
What’s been interesting to me is that even when the task is tightly defined and you control for context (fresh runs, bounded inputs, etc.), there still seems to be a gradual shift in behavior over repeated executions.
It’s subtle — not enough to fail constraints or break structure immediately — but enough that consistency starts to degrade over time.
So it feels like those factors explain part of it, but not all of it.
Have you seen cases where everything is tightly scoped and isolated, but the outputs still slowly diverge anyway?