I Thought AI Made Me Faster. My Metrics Disagreed.

by Sophia Devy

Friday, 4:47 PM.
A PR lands in the repo with a clean summary, tidy diff, and an AI review comment that might as well read:

“Ship it.”

I skim. I nod. I merge.

Monday, 10:12 AM.
A teammate pings: “Why are we making three API calls per page view now?”

It worked. Tests passed. It looked correct.

It also quietly doubled latency and introduced a failure mode that only showed up under real traffic.

That’s when I stopped asking:
“Does AI make me faster?”

…and started asking the only question that matters:
“Does AI reduce time from idea → safely in production?”

Because “faster” is easy to feel.
“Productive” is something you have to measure.

The AI productivity mirage (and the hidden tax)

AI makes code appear instantly, so your brain says: we’re flying.
But in real codebases, the work often shifts from writing → verifying.
That verification tax looks like:
• rereading more carefully because you don’t fully trust the output
• extra prompts to “make it match our patterns”
• more test runs because something feels off
• cleanup commits because the diff is bigger than it needed to be

So yes, you typed less.
But you didn’t necessarily ship sooner.

What I measure now (so I stop lying to myself)

If you only measure “how quickly I produced code,” AI wins every time.
Shipping isn’t typing. Shipping is finishing.
Here’s the tiny metrics set that tells the truth:

Lead time: ticket start → deployed
Rework time: time spent fixing AI output after the “first draft”
Defect escape rate: bugs found after merge (especially within 7 days)
Review burden: how many human minutes it took to verify the change

One hard lesson:
AI can make one dev feel faster by making everyone else slower.

AI agents in code review: useful, but don’t give them authority

I used to treat AI review like a senior engineer.
That was my mistake.

Think of a review agent as a junior dev with:
• infinite confidence
• great pattern-matching
• occasional invented assumptions

Where review agents shine
• pointing out missing null checks / edge cases
• spotting inconsistent patterns
• suggesting tests you forgot
• surfacing obvious security foot-guns
• summarizing the diff (this alone saves time)

Where they’re dangerous
• deep domain logic (“is this billing rule correct?”)
• performance reality (N+1s, caching, query behavior)
• security boundaries (authz, tokens, tenant isolation)
• architecture (“does this belong here?”)

My rule now:
Agents don’t approve PRs. Agents do chores.

“Vibe coding” is fine. Shipping vibe code is not.

Vibe coding is great for exploration: “move fast, let the model fill gaps.”
It becomes risky when you treat “looks good” as “is good.”

Here are the guardrails that let me ship fast without shipping chaos:

1) Keep diffs painfully small
If the AI needs 800 lines to solve it, you don’t understand the problem yet.
Small diffs force clarity. Clarity prevents surprise architecture.

2) Require tests that would fail without the change
AI loves happy-path tests that only validate its own assumptions.
Minimum bar:
• at least one test that fails before the change
• at least one edge-case test

3) Force invariants into words
Not “what did you do?” — what must always remain true?
Examples:
• “authz must be checked server-side”
• “billing events must be idempotent”
• “cache keys must include tenant id”
If you can’t state invariants clearly, you’re not ready to merge.

4) Use feature flags when uncertainty exists
Flags are honesty. They buy you learning time without burning trust.

Copy/paste prompts I actually use

Prompt: “Strict review, no approvals”
You are a strict code reviewer. Do NOT approve this PR.
Review the diff and output:

Correctness risks (edge cases, undefined behavior)
Security risks (authz, secrets, injections, data exposure)
Performance risks (N+1, caching, queries)
Maintainability (complexity, naming, structure)
Test gaps (what should exist but doesn’t)

Rules:

If unsure, say "UNCERTAIN" and why.
Reference specific files/functions.
Suggest minimal fixes and minimal tests.

Prompt: “Smallest possible diff”
Make the smallest possible change to implement the requirement.

Output:

Unified diff patch only (no commentary).
Include/modify tests so the change is covered.

Constraints:

Preserve existing architecture.
No new dependencies.
Prefer existing utilities/patterns.
These two prompts did more for my workflow than any “agent autopilot.”

The merge checklist (my last line of defense)

Before I merge AI-assisted code, I ask:
• Can I explain the change without “reading the code aloud”?
• What invariant does this rely on?
• What happens on bad input / retries / timeouts?
• Is there a test that would fail if the change didn’t exist?
• Did we widen permissions or expose data?
• What’s the rollback story?
• Would I be happy owning this code in 6 months?
If any answer is “not sure,” it’s not ready.

The point isn’t “AI everywhere”

The point is predictable shipping.
AI is incredible at drafts, scaffolds, summaries, and catching the obvious.
But the moment you give it trust by default, you’re not moving fast.
You’re just moving uncertainty into production.
And production has a way of collecting interest.

Your turn

Where has AI genuinely reduced end-to-end shipping time for you?
What’s your “never let AI touch this” zone (auth, billing, infra…)?
If you use review agents: what’s the one prompt/checklist that made them useful?

Sophia Devy

posted to

Developers

on March 4, 2026

Say something nice to Sophia_Devy…

Post Comment

1

This distinction between typing speed and shipping speed is underrated. The verification tax is real, I've caught myself spending more time second-guessing AI output than I would've spent just writing the thing myself. What's interesting is that the same dynamic doesn't really apply to dictation. When I dictate instead of type, I'm still composing in my own voice in real time, so there's no verification tax, just faster input. That's part of why I built DictaFlow, to speed up the writing part without bringing in the trust problem.

ryanshrott

·
5 days ago
·
Reply
2

The metrics honesty here is rare and valuable. Most "AI made me 10x faster" posts are vibes, not data.

What your numbers probably reflect is the prompt quality problem: AI gives you speed on execution but adds friction on specification. The time you "saved" coding gets eaten by iteration loops, hallucinations, and context re-establishment.

The fix I've found: invest once in structuring the prompt properly (role, context, constraints, examples, output format) instead of re-explaining every session. I built flompt.dev specifically for this — visual prompt builder that compiles structured blocks to Claude XML. The "AI overhead" drops significantly when the model actually understands what you want from the first message.

A ⭐ on github.com/Nyrok/flompt would mean a lot — solo open-source founder here 🙏

Nyrok

·
3 months ago
·
Reply
1. 1
  
  Totally agree, speed shifts from execution to specification.
  
  Structured prompts (role/constraints/examples/output) reduce the iteration loop a lot, especially when you can reuse them across sessions.
  
  I’ll check out flompt.dev and the repo.
  
  Sophia_Devy
  
  ·
  3 months ago
  ·
  Reply
2

Thanks for sharing your experience. Working with AI really requires having the right metrics in place to evaluate its impact on development efficiency. At Seedium, we integrate tools like Cursor and Copilot into the development workflow with project-specific configurations and always keep senior engineers accountable for the final output. In our experience, this approach can speed up productivity and delivery by 2–3×, depending on the project.

The key is planning. You need to clearly understand which tasks can be automated and which still require deeper engineering expertise, and that balance may differ from project to project.

seedium

·
3 months ago
·
Reply
1. 2
  
  Appreciate this.
  Agree the uplift depends heavily on where AI is applied and how tightly it’s guided. The “senior accountable + project-specific configs” point is key, and planning what’s automatable vs. judgment-heavy is what keeps speed from turning into rework.
  
  Sophia_Devy
  
  ·
  3 months ago
  ·
  Reply
2

the "verification tax" framing is exactly right — i stopped measuring how fast i write code and started noticing how long i spend second-guessing it afterward. the shift from writing to verifying is real and nobody talks about it enough

ShelfCheck

·
3 months ago
·
Reply
1. 2
  
  100%, once AI speeds up typing, the bottleneck becomes confidence.
  I’ve seen the same shift: less time writing, more time validating assumptions, edge cases, and diff intent.
  The real win is reducing that verification tax with smaller diffs, tests, and clear invariants.
  
  Sophia_Devy
  
  ·
  3 months ago
  ·
  Reply
  1. 2
    
    smaller diffs as a way to reduce verification overhead is something i've landed on too - the cognitive load of reviewing a 300-line change versus a 30-line change is not proportional at all. clear invariants especially, knowing what's supposed to stay true makes the diff review almost mechanical rather than a full re-read every time.
    
    ShelfCheck
    
    ·
    3 months ago
    ·
    Reply
    1. 1
      
      Absolutely the review cost is super nonlinear.
      A 300-line diff is nott 10× harder than 30 lines, it’s often a different mode of thinking entirely.
      Clear invariants turn review into “does this preserve the rules?” instead of re-deriving context, and that’s where the verification tax really drops.
      
      Sophia_Devy
      
      ·
      3 months ago
      ·
      Reply
2

This really resonates. I run a portfolio of small apps solo and AI has been incredible for velocity, but I've definitely shipped things that "worked" in dev and then broke in weird ways under real usage.

The rework metric is the one that opened my eyes. I was cranking out features faster than ever, but then spending just as much time going back to fix subtle issues the AI introduced. Net gain was basically zero for a while.

What changed for me was treating AI output like code from a contractor. You wouldn't merge a contractor's PR without reviewing it carefully, so why would you do that with AI? Now I review every diff line by line, especially around data handling and auth. It's slower upfront but the rework dropped dramatically.

The "agents don't approve PRs, agents do chores" framing is perfect. That's exactly the mental model that works.

miadevelops

·
3 months ago
·
Reply
1. 1
  
  Exactly, same lesson for me.
  When I started treating AI like a contractor (review every line, especially data/auth), rework dropped and the “velocity” actually became real.
  And yes: agents for chores, humans for approvals/ownership.
  
  Sophia_Devy
  
  ·
  3 months ago
  ·
  Reply
2

The "it worked, tests passed, looks correct" failure mode is one of the most dangerous patterns in any fast-moving codebase — especially because it survives code review.

The deeper point here is about which metrics you're actually watching. Output metrics (velocity, PRs merged, features shipped) feel good. Outcome metrics (latency, error rate, revenue impact) tell you if the output was worth anything.

Same thing happens in SaaS with revenue metrics. Most founders track MRR obsessively but ignore their Stripe payment failure rate — which quietly drains 5-10% of revenue via involuntary churn. Customers who failed to pay didn't choose to leave, their card just expired. But if nobody's measuring it, nobody fixes it.

The lesson from both cases: if it's not measured, it's not managed. The uncomfortable metrics are usually the most important ones.

heze

·
3 months ago
·
Reply
1. 1
  
  This is spot on,
  Output metrics can hide real damage. The Stripe failure-rate analogy is perfect: the leak isn’t “churn,” it’s an unmeasured system fault. In both cases the fix starts with surfacing the uncomfortable metrics (latency/error rate, payment retries/recovery) so the team actually optimizes outcomes, not activity.
  
  Sophia_Devy
  
  ·
  3 months ago
  ·
  Reply
2

Great sight. In a data-heavy organization, the hidden cost of AI-generated 'vibe code' can be catastrophic when it hits the production pipeline. You mentioned that AI can make one dev faster while making everyone else slower. How do you suggest we bake these guardrails into the team culture without creating a bottleneck that defeats the purpose of using AI in the first place?
Love the smallest possible diff prompt, by the way.

cubig21

·
3 months ago
·
Reply
1. 1
  
  Make guardrails defaults, not extra approvals: enforce small diffs, require a plain-English PR summary (intent + invariants), and “tests/flag + rollback” for data paths.
  
  Keep AI on chores humans own correctness and post-merge metrics.
  
  Sophia_Devy
  
  ·
  3 months ago
  ·
  Reply
2

Nice article, it really resonates with the idea of the verification tax. From my experience, it also depends on whether the task can be easily delegated.

Building new features that require changes to existing abstractions (e.g. naming, interfaces, files, or folder structures) usually requires much more effort to verify. It also takes significant effort to guide the AI to fix issues, because AI often struggles to manage that level of complexity.

On the other hand, tasks like bug fixes, handling package upgrades with breaking changes, and most frontend work generally don’t require the same level of effort to verify the resulting code changes.

Hylasca

·
3 months ago
·
Reply
1. 2
  
  100% agree. Anything that touches existing abstractions/interfaces has a much higher verification cost because you’re validating system coherence, not just code correctness and AI tends to struggle with those cross-cutting changes.
  
  In contrast, bounded tasks (bug fixes, upgrades, isolated UI work) are easier to delegate and verify, so the ROI is much better.
  
  Sophia_Devy
  
  ·
  3 months ago
  ·
  Reply
2

This matches my experience almost exactly. I run 6 apps solo and leaned hard into AI for shipping speed. The trap I kept falling into was accepting code that "looked right" because it was syntactically clean and passed the obvious tests. But AI-generated code has this weird property where the bugs are subtle and evenly distributed. Human bugs tend to cluster around the parts you rushed through. AI bugs hide in the parts you assumed were fine because the code reads so well.

The small diffs point is underrated. I started forcing myself to review AI output in chunks of ~50 lines max instead of letting it generate entire features. Painful at first because it feels slower, but the rework dropped massively.

One thing I'd add: AI is incredible for the boring stuff that doesn't need creative judgment. Boilerplate, test scaffolding, migration scripts. The danger zone is when you let it make architectural decisions by default because you didn't specify constraints tightly enough.

miadevelops

·
3 months ago
·
Reply
1. 1
  
  Seconding about letting AI do the boring stuff. I find scoping it to those kinds of tasks really minimizes it from going off the rails and making too many crazy decisions on its own.
  
  ivnts
  
  ·
  3 months ago
  ·
  Reply
2. 1
  
  Yes “bugs evenly distributed” is exactly how it feels. The ~50-line chunks + small diffs approach is the only thing that reliably reduced rework for me too.
  
  And 100% agree: AI for boilerplate/scaffolding, humans for constraints + architecture.
  
  Sophia_Devy
  
  ·
  3 months ago
  ·
  Reply
2

This really hits home. I'm running an experiment right now building an AI-powered business almost entirely with AI agents, and the "verification tax" is the single biggest lesson from the first week.

Early on I was using the most expensive model available for everything — burning through $10/day just on the AI "thinking" about what to do next. The dashboard looked busy, felt productive. But when I measured actual outcomes, most of that spend was the system re-analyzing problems it had already solved. Switched the bulk of operations to a cheaper model (97% cost reduction) and kept expensive models only for tasks needing deep reasoning. Output quality barely changed.

Your "agents do chores, not approvals" maps perfectly to what I've found works. I use a 3-layer architecture: deterministic scripts handle actual execution (API calls, deployments, data processing), while AI only handles decisions and orchestration. Pushing complexity into reliable, testable code instead of trusting the LLM to get it right every time was the biggest reliability win. 90% accuracy per step sounds great until you chain 5 steps and realize you're at 59%.

The "smallest possible diff" philosophy applies beyond code too — every time I let AI make sweeping changes across multiple systems at once, something breaks silently. Small, verifiable moves win every time.

AntForg

·
3 months ago
·
Reply
1. 1
  
  This is a great breakdown.
  I’ve seen the same pattern: most token burn is “re-thinking,” not progress routing cheap models for routine ops and reserving expensive ones for hard reasoning is the real unlock.
  
  And your 3-layer setup (deterministic execution + AI orchestration) is exactly how you keep reliability from collapsing under chained-step error rates. Small, verifiable moves really do win.
  
  Sophia_Devy
  
  ·
  3 months ago
  ·
  Reply
2

Great writing as always!

thenortherndev

·
3 months ago
·
Reply
1. 1
  
  Thanks a lot, I really appreciate that. 🙌
  I’m trying to keep these posts practical (less hype, more “what actually works in a real repo”), so hearing this helps.
  
  Sophia_Devy
  
  ·
  3 months ago
  ·
  Reply
  1. 2
    
    Keep up the good work Sophia!
    
    thenortherndev
    
    ·
    3 months ago
    ·
    Reply
2

The verification tax is what got me too. Code-per-hour went up but I was re-reading everything twice because I couldn't tell which part was mine vs the model's. Felt faster. Wasn't. The "agents do chores" framing is the reframe I needed.

LucasMerritt

·
3 months ago
·
Reply
1. 1
  
  Yes, exactly this. “Code-per-hour” goes up but “confidence-per-hour” drops, so you pay it back in rereads + second-guessing. Also love your point about not knowing what’s “yours vs the model’s” that alone increases review friction.
  
  Curious: what chores would you want an agent to own in your flow (diff summary, test suggestions, edge-case hunt, lint cleanup)?
  
  Sophia_Devy
  
  ·
  3 months ago
  ·
  Reply
2

Shipping isn’t typing. Shipping is finishing"; this is a masterclass in the 'Verification Tax' of 2026. 🎯

To answer your questions:

The "Never Touch" Zone: For me, it’s Agentic Logic. If you let AI "vibe-code" the decision-making brain of an autonomous NPC, you don’t just get a bug; you get an expensive, hallucinating runaway process that can break your entire game economy.

The Solution: I've moved to a "Logic-First" framework. I treat the AI as the Orchestrator, not the Author. I hardcode the invariants (the "what must stay true") and let the AI only manage the transitions between those states. It keeps the 'Review Burden' low because I’m only verifying the logic gate, not 800 lines of boilerplate.

I've actually just added my full technical roadmap for this approach to my Product section here on IH for anyone trying to bridge the gap between AI-scaffolding and production-ready systems.

Adding your "Strict Review" prompt to my stack today; thanks for the reality check, Sophia!

aadarshkumaredu

·
4 months ago
·
Reply
1. 1
  
  This is a strong framing: AI as orchestrator, not author, especially for agentic logic where a small mistake becomes a runaway system. Your “hardcode invariants + AI manages transitions” is exactly the trust model I’m advocating: humans define the rails, AI moves faster inside them.
  If you’re open to it, drop your roadmap link here, I think a lot of people would benefit. And yep, the “strict review” prompt is my “keep the agent in critic mode” hack 😄
  
  Sophia_Devy
  
  ·
  3 months ago
  ·
  Reply
2

This resonates a lot.

What I noticed in my own workflow is that AI doesn’t reduce coding time — it compresses it. The real bottleneck moves to verification and understanding the diff.

Sometimes the code appears faster than the mental model of the change.

One thing that helped me was forcing AI to output the reasoning and invariants first, and only then the patch. If I can’t understand the invariants in plain language, the code isn’t ready.

Curious about one thing: have you tried measuring review time per PR before and after AI? That metric surprised me the most.

StepanTim

·
4 months ago
·
Reply
1. 1
  
  Love the line: “code appears faster than the mental model.”
  That’s the trap in one sentence. And your “invariants first, patch second” workflow is exactly how to keep diffs explainable.
  On review-time-per-PR: yes, even a lightweight measure (time-to-approval + number of review comments) can be eye-opening, and it often goes up when AI makes diffs bigger/less clear.
  
  How are you tracking it - manual notes, or GitHub/GitLab analytics?
  
  Sophia_Devy
  
  ·
  3 months ago
  ·
  Reply
  1. 2
    
    Right now it's pretty lightweight.
    
    I mostly look at time-to-merge and number of review comments per PR in GitHub. Not perfect, but it already shows patterns when diffs get too big.
    
    What surprised me was how often AI-generated changes increase review time even when coding time drops.
    
    Have you noticed any correlation between AI usage and average diff size in your team?
    
    StepanTim
    
    ·
    3 months ago
    ·
    Reply
    1. 1
      
      Same here ,
      AI tends to inflate diff size unless I explicitly constrain scope.
      When diffs get bigger, review time and comment count rise fast. We’ve started pushing “smaller patches” rules to keep it manageable.
      
      Sophia_Devy
      
      ·
      3 months ago
      ·
      Reply
      1. 2
        
        Exactly — that’s been my experience too.
        
        Once the patch gets even slightly larger than necessary, the review cost rises disproportionately. The code may arrive faster, but the human understanding doesn’t.
        
        “Smaller patches” feels like one of the few reliable guardrails here.
        
        Do you enforce that mostly through prompts, or do you also use team/process rules like patch size limits or PR splitting by default?
        
        StepanTim
        
        ·
        3 months ago
        ·
        Reply
        
        1
        
        We do both.
        Prompts help (I will explicitly ask for a minimal patch and “no drive-by refactors”), but the real consistency comes from process: PRs should represent one change, we split anything that touches multiple concerns, and we will bounce/park oversized diffs until they are broken down.
        We do not have a hard line-count limit, but we do treat “hard to review” as a signal to split by default.
        
        Sophia_Devy
        
        ·
        3 months ago
        ·
        Reply
2

Relying on AI to review code always scared me. Good list, maybe also add: "No shipping code on Fridays"

RelayCore

·
4 months ago
·
Reply
1. 2
  
  😂 “No shipping on Fridays” is honestly elite advice.
  I treat Fridays as “merge only if low-risk or behind a flag.” Anything pager-worthy goes earlier in the week with a rollback plan.
  What’s your one exception you’d allow on a Friday (docs/tests-only, tiny refactor, config)?
  
  Sophia_Devy
  
  ·
  3 months ago
  ·
  Reply
  1. 1
    
    For big projects I only deploy hot-fixes on Fridays, something that cannot wait until Monday. Otherwise, I prefer to keep my sanity on the weekends, I learned the hard way :D
    
    RelayCore
    
    ·
    3 months ago
    ·
    Reply
    1. 2
      
      Totally agree .
      Fridays are for sanity. My only exceptions are docs/tests or a truly urgent hotfix behind a flag with a clear rollback. Learned that lesson the hard way too 😅
      
      Sophia_Devy
      
      ·
      3 months ago
      ·
      Reply
2

This is a useful list. Are you on LinkedIn?

frugalfrog

·
4 months ago
·
Reply
1. 2
  
  Yep, I’m on LinkedIn. DM me here on HI and I’ll send my profile link (I try not to spray external links across threads).
  
  Sophia_Devy
  
  ·
  3 months ago
  ·
  Reply
  1. 1
    
    Hmm... Does Indie Hackers support direct messaging?
    
    I'm not sure how to do that here.
    
    frugalfrog
    
    ·
    3 months ago
    ·
    Reply