I stopped running benchmarks on my agents. Here's what replaced them in a 5-agent shop.

Field notes week of April 21

Context: I'm a solo founder (Rapid Claw), my brother Brandon handles most of the infra, and we run about 5 agents in production on any given day. Small crew, small blast radius, and honestly that's the only reason we can get away with what I'm about to describe.

Last week there was a Hacker News post (and a real paper) showing researchers getting near-perfect scores on prominent AI agent benchmarks without solving a single task. That hit a nerve. We'd been quietly drifting away from benchmarks for months and this gave us the excuse to finally write down why.

Quick honesty check on numbers before I go further. We are at low-4-figure MRR, five agents live, and fewer than two dozen paying customers. I am not about to tell you what works at scale. I'm telling you what works at our scale, this month.

Here's the arc.

Phase 1: benchmarks made us feel smart

When we started, we cared a lot about how our default agent templates scored on public benchmarks. Pass@1 on SWE-Bench Lite, tool-use accuracy, browser nav success, that whole menu. It felt rigorous. We'd swap a model, rerun a suite, and if the number went up, we'd ship it.

Problem: our customers never once complained about benchmark deltas. They complained about things like "the agent burned through my budget on a loop," "the agent silently stopped picking up jobs," and "the agent said it finished but my queue still had the task." None of those show up as a benchmark score.

Phase 2: we replaced the benchmark suite with four boring production numbers

These are the only four we look at now, per agent, per day:

Time-to-first-useful-output. From task accepted to the first artifact a human would consider useful. Not "first token." Not "first tool call." First useful thing.
Cost per completed task. Dollars spent divided by tasks that actually closed with a passing acceptance check. Open-ended tasks inflate this fast, which is exactly what we want to see.
Loop rate. Percent of runs that hit our "rethink, rewrite, rethink" circuit breaker before completion. We treat anything above 4% as a design problem, not a prompt problem.
Silent stall rate. Percent of runs where the worker stopped making progress but didn't crash. This is the one that used to eat entire nights before we had heartbeats.

That's it. Four numbers. Per agent. Every day.

Phase 3: traces are the thing

The numbers point at the agent. The traces tell you why. We log every tool call, every model call, every retry, with inputs, outputs, and cost, pinned to a run ID. When a number moves we don't guess, we open the worst trace of the day and read it end to end.

I wrote up our stack for this over here: AI agent observability. It's the boring load-bearing part of running agents unattended. If I could go back, I would have built this before I built the second agent.

What actually moved since we switched

Loop rate on our research agent dropped from about 9% to 3% once we forced every long task to declare a concrete "done looks like X" check. Same model, same prompts, just a tighter acceptance contract.
Cost per completed task on the cleanup agent dropped roughly 40% after we capped tool-call depth and killed an unbounded "reflect and try again" step.
Silent stalls went from "several per week" to "maybe one a month" once Brandon wired up a heartbeat plus a "no progress in N minutes" alert. I wrote more about that in building AI agents in production.
Benchmark scores on the replaced configs are actually slightly worse on paper. Customers are happier anyway. Make of that what you will.

The honest caveats

This only works because we have five agents, not fifty. At fifty, I'd probably need a whole separate layer for aggregating trace anomalies.
We still run benchmarks quarterly as a sanity check. They are fine as a sniff test. They are a bad daily signal.
None of this is novel. Observability people have been shouting this at us for years. We just weren't listening while the leaderboard felt fun.

If you're running agents in production and you're still staring at benchmark scores to decide what to ship, I'd gently suggest switching to whatever four numbers your customers would actually pay to improve. Different for everyone. Mine are above.

Curious what broke first for folks here and what signal replaced it. If you're weighing hosting choices for this kind of setup, our take is at managed AI agents.

Tijo

Say something nice to rapidclaw…

1

Thanks For Sharing Looks very Interesting.

James670

·
6 hours ago
·
1

Benchmarks optimize for bragging rights, production metrics optimize for reality.
The moment you charge money, leaderboard scores matter less than whether the job gets done reliably.

clawback

·
15 hours ago
·