I Thought AI Made Me Faster. My Metrics Disagreed.

by Sophia Devy

Friday, 4:47 PM.
A PR lands in the repo with a clean summary, tidy diff, and an AI review comment that might as well read:

“Ship it.”

I skim. I nod. I merge.

Monday, 10:12 AM.
A teammate pings: “Why are we making three API calls per page view now?”

It worked. Tests passed. It looked correct.

It also quietly doubled latency and introduced a failure mode that only showed up under real traffic.

That’s when I stopped asking:
“Does AI make me faster?”

…and started asking the only question that matters:
“Does AI reduce time from idea → safely in production?”

Because “faster” is easy to feel.
“Productive” is something you have to measure.

The AI productivity mirage (and the hidden tax)

AI makes code appear instantly, so your brain says: we’re flying.
But in real codebases, the work often shifts from writing → verifying.
That verification tax looks like:
• rereading more carefully because you don’t fully trust the output
• extra prompts to “make it match our patterns”
• more test runs because something feels off
• cleanup commits because the diff is bigger than it needed to be

So yes, you typed less.
But you didn’t necessarily ship sooner.

What I measure now (so I stop lying to myself)

If you only measure “how quickly I produced code,” AI wins every time.
Shipping isn’t typing. Shipping is finishing.
Here’s the tiny metrics set that tells the truth:

Lead time: ticket start → deployed
Rework time: time spent fixing AI output after the “first draft”
Defect escape rate: bugs found after merge (especially within 7 days)
Review burden: how many human minutes it took to verify the change

One hard lesson:
AI can make one dev feel faster by making everyone else slower.

AI agents in code review: useful, but don’t give them authority

I used to treat AI review like a senior engineer.
That was my mistake.

Think of a review agent as a junior dev with:
• infinite confidence
• great pattern-matching
• occasional invented assumptions

Where review agents shine
• pointing out missing null checks / edge cases
• spotting inconsistent patterns
• suggesting tests you forgot
• surfacing obvious security foot-guns
• summarizing the diff (this alone saves time)

Where they’re dangerous
• deep domain logic (“is this billing rule correct?”)
• performance reality (N+1s, caching, query behavior)
• security boundaries (authz, tokens, tenant isolation)
• architecture (“does this belong here?”)

My rule now:
Agents don’t approve PRs. Agents do chores.

“Vibe coding” is fine. Shipping vibe code is not.

Vibe coding is great for exploration: “move fast, let the model fill gaps.”
It becomes risky when you treat “looks good” as “is good.”

Here are the guardrails that let me ship fast without shipping chaos:

1) Keep diffs painfully small
If the AI needs 800 lines to solve it, you don’t understand the problem yet.
Small diffs force clarity. Clarity prevents surprise architecture.

2) Require tests that would fail without the change
AI loves happy-path tests that only validate its own assumptions.
Minimum bar:
• at least one test that fails before the change
• at least one edge-case test

3) Force invariants into words
Not “what did you do?” — what must always remain true?
Examples:
• “authz must be checked server-side”
• “billing events must be idempotent”
• “cache keys must include tenant id”
If you can’t state invariants clearly, you’re not ready to merge.

4) Use feature flags when uncertainty exists
Flags are honesty. They buy you learning time without burning trust.

Copy/paste prompts I actually use

Prompt: “Strict review, no approvals”
You are a strict code reviewer. Do NOT approve this PR.
Review the diff and output:

Correctness risks (edge cases, undefined behavior)
Security risks (authz, secrets, injections, data exposure)
Performance risks (N+1, caching, queries)
Maintainability (complexity, naming, structure)
Test gaps (what should exist but doesn’t)

Rules:

If unsure, say "UNCERTAIN" and why.
Reference specific files/functions.
Suggest minimal fixes and minimal tests.

Prompt: “Smallest possible diff”
Make the smallest possible change to implement the requirement.

Output:

Unified diff patch only (no commentary).
Include/modify tests so the change is covered.

Constraints:

Preserve existing architecture.
No new dependencies.
Prefer existing utilities/patterns.
These two prompts did more for my workflow than any “agent autopilot.”

The merge checklist (my last line of defense)

Before I merge AI-assisted code, I ask:
• Can I explain the change without “reading the code aloud”?
• What invariant does this rely on?
• What happens on bad input / retries / timeouts?
• Is there a test that would fail if the change didn’t exist?
• Did we widen permissions or expose data?
• What’s the rollback story?
• Would I be happy owning this code in 6 months?
If any answer is “not sure,” it’s not ready.

The point isn’t “AI everywhere”

The point is predictable shipping.
AI is incredible at drafts, scaffolds, summaries, and catching the obvious.
But the moment you give it trust by default, you’re not moving fast.
You’re just moving uncertainty into production.
And production has a way of collecting interest.

Your turn

Where has AI genuinely reduced end-to-end shipping time for you?
What’s your “never let AI touch this” zone (auth, billing, infra…)?
If you use review agents: what’s the one prompt/checklist that made them useful?

Sophia Devy

posted to

Developers

on March 4, 2026

Say something nice to Sophia_Devy…

Post Comment

2

Great writing as always!

thenortherndev

·
2 days ago
·
Reply
1. 1
  
  Thanks a lot, I really appreciate that. 🙌
  I’m trying to keep these posts practical (less hype, more “what actually works in a real repo”), so hearing this helps.
  
  Sophia_Devy
  
  ·
  2 days ago
  ·
  Reply
2

The verification tax is what got me too. Code-per-hour went up but I was re-reading everything twice because I couldn't tell which part was mine vs the model's. Felt faster. Wasn't. The "agents do chores" framing is the reframe I needed.

LucasMerritt

·
2 days ago
·
Reply
1. 1
  
  Yes, exactly this. “Code-per-hour” goes up but “confidence-per-hour” drops, so you pay it back in rereads + second-guessing. Also love your point about not knowing what’s “yours vs the model’s” that alone increases review friction.
  
  Curious: what chores would you want an agent to own in your flow (diff summary, test suggestions, edge-case hunt, lint cleanup)?
  
  Sophia_Devy
  
  ·
  2 days ago
  ·
  Reply
2

Shipping isn’t typing. Shipping is finishing"; this is a masterclass in the 'Verification Tax' of 2026. 🎯

To answer your questions:

The "Never Touch" Zone: For me, it’s Agentic Logic. If you let AI "vibe-code" the decision-making brain of an autonomous NPC, you don’t just get a bug; you get an expensive, hallucinating runaway process that can break your entire game economy.

The Solution: I've moved to a "Logic-First" framework. I treat the AI as the Orchestrator, not the Author. I hardcode the invariants (the "what must stay true") and let the AI only manage the transitions between those states. It keeps the 'Review Burden' low because I’m only verifying the logic gate, not 800 lines of boilerplate.

I've actually just added my full technical roadmap for this approach to my Product section here on IH for anyone trying to bridge the gap between AI-scaffolding and production-ready systems.

Adding your "Strict Review" prompt to my stack today; thanks for the reality check, Sophia!

aadarshkumaredu

·
2 days ago
·
Reply
1. 1
  
  This is a strong framing: AI as orchestrator, not author, especially for agentic logic where a small mistake becomes a runaway system. Your “hardcode invariants + AI manages transitions” is exactly the trust model I’m advocating: humans define the rails, AI moves faster inside them.
  If you’re open to it, drop your roadmap link here, I think a lot of people would benefit. And yep, the “strict review” prompt is my “keep the agent in critic mode” hack 😄
  
  Sophia_Devy
  
  ·
  2 days ago
  ·
  Reply
2

This resonates a lot.

What I noticed in my own workflow is that AI doesn’t reduce coding time — it compresses it. The real bottleneck moves to verification and understanding the diff.

Sometimes the code appears faster than the mental model of the change.

One thing that helped me was forcing AI to output the reasoning and invariants first, and only then the patch. If I can’t understand the invariants in plain language, the code isn’t ready.

Curious about one thing: have you tried measuring review time per PR before and after AI? That metric surprised me the most.

StepanTim

·
2 days ago
·
Reply
1. 1
  
  Love the line: “code appears faster than the mental model.”
  That’s the trap in one sentence. And your “invariants first, patch second” workflow is exactly how to keep diffs explainable.
  On review-time-per-PR: yes, even a lightweight measure (time-to-approval + number of review comments) can be eye-opening, and it often goes up when AI makes diffs bigger/less clear.
  
  How are you tracking it - manual notes, or GitHub/GitLab analytics?
  
  Sophia_Devy
  
  ·
  2 days ago
  ·
  Reply
  1. 1
    
    Right now it's pretty lightweight.
    
    I mostly look at time-to-merge and number of review comments per PR in GitHub. Not perfect, but it already shows patterns when diffs get too big.
    
    What surprised me was how often AI-generated changes increase review time even when coding time drops.
    
    Have you noticed any correlation between AI usage and average diff size in your team?
    
    StepanTim
    
    ·
    a day ago
    ·
    Reply
2

Relying on AI to review code always scared me. Good list, maybe also add: "No shipping code on Fridays"

RelayCore

·
2 days ago
·
Reply
1. 2
  
  😂 “No shipping on Fridays” is honestly elite advice.
  I treat Fridays as “merge only if low-risk or behind a flag.” Anything pager-worthy goes earlier in the week with a rollback plan.
  What’s your one exception you’d allow on a Friday (docs/tests-only, tiny refactor, config)?
  
  Sophia_Devy
  
  ·
  2 days ago
  ·
  Reply
  1. 1
    
    For big projects I only deploy hot-fixes on Fridays, something that cannot wait until Monday. Otherwise, I prefer to keep my sanity on the weekends, I learned the hard way :D
    
    RelayCore
    
    ·
    2 days ago
    ·
    Reply
2

This is a useful list. Are you on LinkedIn?

frugalfrog

·
3 days ago
·
Reply
1. 1
  
  Yep, I’m on LinkedIn. DM me here on HI and I’ll send my profile link (I try not to spray external links across threads).
  
  Sophia_Devy
  
  ·
  2 days ago
  ·
  Reply
1

Thanks for sharing your experience. Working with AI really requires having the right metrics in place to evaluate its impact on development efficiency. At Seedium, we integrate tools like Cursor and Copilot into the development workflow with project-specific configurations and always keep senior engineers accountable for the final output. In our experience, this approach can speed up productivity and delivery by 2–3×, depending on the project.

The key is planning. You need to clearly understand which tasks can be automated and which still require deeper engineering expertise, and that balance may differ from project to project.

seedium

·
a day ago
·
Reply
1

the "verification tax" framing is exactly right — i stopped measuring how fast i write code and started noticing how long i spend second-guessing it afterward. the shift from writing to verifying is real and nobody talks about it enough

ShelfCheck

·
a day ago
·
Reply
1

This really resonates. I run a portfolio of small apps solo and AI has been incredible for velocity, but I've definitely shipped things that "worked" in dev and then broke in weird ways under real usage.

The rework metric is the one that opened my eyes. I was cranking out features faster than ever, but then spending just as much time going back to fix subtle issues the AI introduced. Net gain was basically zero for a while.

What changed for me was treating AI output like code from a contractor. You wouldn't merge a contractor's PR without reviewing it carefully, so why would you do that with AI? Now I review every diff line by line, especially around data handling and auth. It's slower upfront but the rework dropped dramatically.

The "agents don't approve PRs, agents do chores" framing is perfect. That's exactly the mental model that works.

miadevelops

·
2 days ago
·
Reply
1

The "it worked, tests passed, looks correct" failure mode is one of the most dangerous patterns in any fast-moving codebase — especially because it survives code review.

The deeper point here is about which metrics you're actually watching. Output metrics (velocity, PRs merged, features shipped) feel good. Outcome metrics (latency, error rate, revenue impact) tell you if the output was worth anything.

Same thing happens in SaaS with revenue metrics. Most founders track MRR obsessively but ignore their Stripe payment failure rate — which quietly drains 5-10% of revenue via involuntary churn. Customers who failed to pay didn't choose to leave, their card just expired. But if nobody's measuring it, nobody fixes it.

The lesson from both cases: if it's not measured, it's not managed. The uncomfortable metrics are usually the most important ones.

heze

·
2 days ago
·
Reply
1

Great sight. In a data-heavy organization, the hidden cost of AI-generated 'vibe code' can be catastrophic when it hits the production pipeline. You mentioned that AI can make one dev faster while making everyone else slower. How do you suggest we bake these guardrails into the team culture without creating a bottleneck that defeats the purpose of using AI in the first place?
Love the smallest possible diff prompt, by the way.

cubig21

·
2 days ago
·
Reply
1

Nice article, it really resonates with the idea of the verification tax. From my experience, it also depends on whether the task can be easily delegated.

Building new features that require changes to existing abstractions (e.g. naming, interfaces, files, or folder structures) usually requires much more effort to verify. It also takes significant effort to guide the AI to fix issues, because AI often struggles to manage that level of complexity.

On the other hand, tasks like bug fixes, handling package upgrades with breaking changes, and most frontend work generally don’t require the same level of effort to verify the resulting code changes.

Hylasca

·
2 days ago
·
Reply
1

This matches my experience almost exactly. I run 6 apps solo and leaned hard into AI for shipping speed. The trap I kept falling into was accepting code that "looked right" because it was syntactically clean and passed the obvious tests. But AI-generated code has this weird property where the bugs are subtle and evenly distributed. Human bugs tend to cluster around the parts you rushed through. AI bugs hide in the parts you assumed were fine because the code reads so well.

The small diffs point is underrated. I started forcing myself to review AI output in chunks of ~50 lines max instead of letting it generate entire features. Painful at first because it feels slower, but the rework dropped massively.

One thing I'd add: AI is incredible for the boring stuff that doesn't need creative judgment. Boilerplate, test scaffolding, migration scripts. The danger zone is when you let it make architectural decisions by default because you didn't specify constraints tightly enough.

miadevelops

·
2 days ago
·
Reply
1

This really hits home. I'm running an experiment right now building an AI-powered business almost entirely with AI agents, and the "verification tax" is the single biggest lesson from the first week.

Early on I was using the most expensive model available for everything — burning through $10/day just on the AI "thinking" about what to do next. The dashboard looked busy, felt productive. But when I measured actual outcomes, most of that spend was the system re-analyzing problems it had already solved. Switched the bulk of operations to a cheaper model (97% cost reduction) and kept expensive models only for tasks needing deep reasoning. Output quality barely changed.

Your "agents do chores, not approvals" maps perfectly to what I've found works. I use a 3-layer architecture: deterministic scripts handle actual execution (API calls, deployments, data processing), while AI only handles decisions and orchestration. Pushing complexity into reliable, testable code instead of trusting the LLM to get it right every time was the biggest reliability win. 90% accuracy per step sounds great until you chain 5 steps and realize you're at 59%.

The "smallest possible diff" philosophy applies beyond code too — every time I let AI make sweeping changes across multiple systems at once, something breaks silently. Small, verifiable moves win every time.

AntForg

·
2 days ago
·
Reply