Sipcode v1.6.15: the 24-hour bug that taught me my own tool was lying to me

by Anuj18

I shipped a fix for my own tool last week. The bug was that my tool had been quietly lying to me for a week, and I only caught it because I was using it on a real 4-hour session of building my other project.

Quick context. I build Sipcode. It is an MIT CLI for Claude Code, Anthropic's terminal coding agent. It runs as a PreToolUse hook and keeps Claude's context clean across long sessions: caps verbose tool output (git log, npm install, grep, tsc), and dedups same-session re-reads of unchanged files. Anthropic's own research shows cleaner context gives a 29% quality lift and 40% fewer agent errors, so this is mostly a reliability play. Tokens are the proof point, not the pitch.

Here is what happened.

I shipped v1.6.14 on June 14 with a path-normalization fix. Felt good. The next day I sat down to do real work on Answerable (my SaaS), 4 hours of dogfooding. After the session I ran sipcode drift, which is my "how much did Claude waste re-reading the same file" view. It said 624,940 tokens wasted.

That number looked wrong. So I ran sipcode proxy --stats, which is my "how much did the dedup hook catch" view. It said it had saved about 7,553 tokens.

Same session. Two views. 83x apart.

I sat with that for a while. The honest version: my tool was reporting two contradictory numbers about the same 4 hours of work, and I had been telling people on Twitter that it dedups re-reads. Both could not be right.

The bug, once I found it, was almost embarrassing in how obvious it was in hindsight. When a new user installs Sipcode mid-session (which is the common case, because nobody installs a hook before they start working), the dedup cache is empty for the first half of that session. Reads before install are never cached. So when Claude re-reads the same file after install, the cache has no prior copy to compare against, and the dedup fires zero times. The session shows huge "wasted tokens" in drift, and almost nothing saved in stats. Both views were correct. The tool was just blind to the first half of every session it was ever installed into.

The fix is called Verified Warm-Fill. The research that unlocked it: Claude Code's transcript JSONL has an undocumented field, toolUseResult.file.content, that holds the raw file bytes from every prior Read tool call in the session. Nobody had written about it. Without that field this design is not possible. With it, the fix is small: on the first hook fire per session, walk the transcript, back-fill the dedup cache from those historical reads, but only when the transcript bytes match the current disk bytes after LF and BOM canonicalization. If they drift, drop the candidate. Zero false-dedup, by construction, not by test coverage.

I shipped v1.6.15 24 hours later, on June 15. Same release also extended sipcode init so it installs the proxy hook, sets the impact baseline, and verifies the MCP tool count in one command. Test count went from 1,266 to 1,317.

Then I ran the same kind of 4-hour session on Answerable again as an acceptance test. Drift went from 624,940 wasted to "no drift detected." Proxy stats: dedup fired 2.2x more often, saved 3.9x more tokens. The two views agreed for the first time.

The locked benchmark numbers on my 20-task corpus, if you care: 62.6% median tool-output savings (range 37.4% to 80.6%), 3,567,170 tokens saved, $67.43 at current Sonnet pricing.

Three things I took from the week.

Dogfooding on a real session caught a bug that no unit test would ever catch. The cache-empty-at-install case was structurally invisible until I sat down to actually use the thing for 4 hours.
The two-view check (drift vs proxy --stats) was the bug detector. If I had only built one view I would still be shipping the lie. Internal disagreement between your own metrics is a feature.
The fix took 24 hours because the research (finding that JSONL field) was 80% of the work. Once I had it, the code was a few hundred lines.

Sipcode is here if you want to look: https://anuj7411.github.io/sipcode. MIT, on npm, no telemetry, no network calls in normal use.

Question for the IH crowd: what is the dumbest bug you only found because you actually used your own product for real work? I am collecting these. The "I tested it for 5 minutes and shipped" stories are the ones I learn the most from.

Anuj
solo dev, India
https://anuj7411.github.io/sipcode

Anuj18

posted to

Building in Public

on June 19, 2026

Say something nice to Anuj18…

Post Comment

1

That kind of bug is extra scary because the UI can look calm while the truth underneath is wrong. Good reminder that users do not only need features. They need confidence the tool is telling them the truth.

OffBeatDev

·
9 hours ago
·
Reply
1

The "tool lying quietly" pattern is the hardest class of bug to catch — no error thrown, just wrong numbers silently accumulating. Dogfooding on a real 4-hour session is exactly the right pressure test; synthetic benchmarks would never have surfaced this.

Building AI context tools myself (we do something similar with ContextForge — managing what the agent sees before a session), I've found the drift between what you think the agent read vs. what it actually processed is a constant source of subtle failures. The path-normalization edge case is a classic example: file paths that look identical to a human are distinct keys to a hash map.

Good catch, and good on you for shipping the fix and writing it up.

machinatools

·
11 hours ago
·
Reply
1

The part that caught my attention wasn't the bug.

It was the fact that two contradictory numbers turned out to be pointing at the same reality from different directions.

Those situations are interesting because they often look like measurement problems at first.

Sometimes they're really interpretation problems.

aryan_sinh

·
2 days ago
·
Reply
1. 1
  
  You nailed the actual lesson. in my case it was both. proxy --stats only counted dedups that fired AFTER install (the literal scope of what the proxy could see). drift measures the counterfactual (what could have been saved given perfect coverage). different metrics, same session, both correct. the fix (warm-fill) extended the install scope backwards into the transcript so the two views could converge. closer to an interpretation fix than a measurement one, even though it shipped as code.
  
  have you hit this in your own projects? the cases where the data is honest but you were asking the wrong question.
  
  Anuj18
  
  ·
  a day ago
  ·
  Reply
  1. 1
    
    A few times, yes.
    
    What made them interesting wasn't that the numbers disagreed.
    
    It was that resolving the disagreement didn't necessarily resolve the decision sitting underneath it.
    
    That's probably more than I'd try to unpack properly in a thread though.
    
    Happy to continue by email if useful.
    
    aryan_sinh
    
    ·
    a day ago
    ·
    Reply