If you’re building with LLMs, you’ve probably seen this:
…but users quietly start saying: “responses feel worse.”
We hit this in production, and it took us longer than it should have to diagnose because we were watching uptime, not outcome quality.
This post is a practical breakdown of what changed for us, what worked, and what we now run every week to catch drift earlier.
Our stack looked healthy:
But product signals went in the wrong direction:
That was the key lesson: request success != user success.
Instead of starting from model internals, we started from user outcomes.
The 4 metrics that gave us the earliest warning:
When these move while infra is still “normal,” we treat it as a drift investigation immediately.
First question: did user distribution change?
If traffic is similar but outcomes degrade, suspect model path / retrieval / prompt chain.
We split performance by:
In our case, quality drop was concentrated in one segment. A global rollback would have made the rest worse.
Most common hidden causes we saw:
We run controlled A/B checks on the same input set until one factor clearly explains the gap.
Production data is noisy. So we maintain a “golden set” of representative prompts and score it daily.
Not perfect evaluation — just consistent evaluation.
That one habit alone cut our detection lag significantly.
After adopting this flow, we saw:
The biggest shift wasn’t technical, it was operational discipline.
You don’t need a massive platform on day one.
Start with:
This gets you 80% of the value with low overhead.
Silent quality drift is more dangerous than obvious downtime because you notice it late — through trust erosion.
If you catch it before user complaints, you protect both retention and your team’s iteration speed.
If you want to try the workflow we productized internally, you can install AiKey on macOS/Linux with:
curl -fsSL https://aikeylabs.com/zh/i/ih03 | sh
Happy to share our benchmark template in a follow-up if helpful.