
Context: I'm a solo founder (Rapid Claw), my brother Brandon handles most of the infra, and we run about 5 agents in production on any given day. Small crew, small blast radius, and honestly that's the only reason we can get away with what I'm about to describe.
Last week there was a Hacker News post (and a real paper) showing researchers getting near-perfect scores on prominent AI agent benchmarks without solving a single task. That hit a nerve. We'd been quietly drifting away from benchmarks for months and this gave us the excuse to finally write down why.
Quick honesty check on numbers before I go further. We are at low-4-figure MRR, five agents live, and fewer than two dozen paying customers. I am not about to tell you what works at scale. I'm telling you what works at our scale, this month.
Here's the arc.
Phase 1: benchmarks made us feel smart
When we started, we cared a lot about how our default agent templates scored on public benchmarks. Pass@1 on SWE-Bench Lite, tool-use accuracy, browser nav success, that whole menu. It felt rigorous. We'd swap a model, rerun a suite, and if the number went up, we'd ship it.
Problem: our customers never once complained about benchmark deltas. They complained about things like "the agent burned through my budget on a loop," "the agent silently stopped picking up jobs," and "the agent said it finished but my queue still had the task." None of those show up as a benchmark score.
Phase 2: we replaced the benchmark suite with four boring production numbers
These are the only four we look at now, per agent, per day:
That's it. Four numbers. Per agent. Every day.
Phase 3: traces are the thing
The numbers point at the agent. The traces tell you why. We log every tool call, every model call, every retry, with inputs, outputs, and cost, pinned to a run ID. When a number moves we don't guess, we open the worst trace of the day and read it end to end.
I wrote up our stack for this over here: AI agent observability. It's the boring load-bearing part of running agents unattended. If I could go back, I would have built this before I built the second agent.
What actually moved since we switched
The honest caveats
If you're running agents in production and you're still staring at benchmark scores to decide what to ship, I'd gently suggest switching to whatever four numbers your customers would actually pay to improve. Different for everyone. Mine are above.
Curious what broke first for folks here and what signal replaced it. If you're weighing hosting choices for this kind of setup, our take is at managed AI agents.
Tijo
Thanks For Sharing Looks very Interesting.
Benchmarks optimize for bragging rights, production metrics optimize for reality.
The moment you charge money, leaderboard scores matter less than whether the job gets done reliably.