I build with coding agents every day, mostly Claude Code, with a bit of Cursor and Codex. I had a vague feeling I was productive, but no real measure of it. So for 31 days I tracked everything: prompts, tool calls, sessions, commits, and PRs. That came to about 2.3K prompts, 6.3B tokens, and roughly $6K in pay per token equivalent spend.
A few things genuinely surprised me.
Most of my prompts do not directly ship code. Only about 5% led to a commit or PR. I would have guessed something closer to 20 to 30%. The rest was research, planning, debugging, or just using the wrong tool. It made me realise that a lot of real work happens before anything gets committed.
Process also matters a lot. Prompts that sat inside some kind of workflow, like planning, TDD, or subagent driven development, were much more likely to lead to shipped code than direct prompts. Brainstorming (superpowers:brainstorming skill) was heavily used but rarely shipped anything, which makes sense, but I had never really measured how much time went into that phase.
My code velocity also ramped up a lot over the month. I do not think I suddenly got smarter. I think I just got better at working with the agent: when to plan first, when to spawn a subagent, when to inspect state, and when to let it run.
I also realised that a single prompt is rarely a single action. On average, a prompt turned into a long multi step run with many tool calls. The real unit of work is not the prompt. It is the session.
The last surprise was the difference between personal and work projects. On my own repos, prompts often led to commits and PRs. On work repos, token usage was high but commits were rare. Same agent, same person, different mode. At work I mostly use the agent to read, explain, and debug existing systems. At home I use it to build.
The main thing this changed for me is awareness. I now catch myself asking whether I am actually shipping or just exploring. Both are useful, but knowing which mode I am in has already changed how I work.
The obvious limitation is that this is still just me. I do not know how much of this is personal habit versus something broader about AI assisted coding.
What I would love to hear from others is:
Have you ever measured your own AI coding workflow? What surprised you?
Which of these patterns feels familiar?
If you use multiple models, what actually helped you compare them in practice?
If you had a dashboard for your own agent use, what is the one metric you would want that nobody shows yet?
This is super interesting, especially the 5% of prompts leading to commits. I would’ve also guessed way higher.
The “session vs prompt” idea really clicked for me. It feels like most of the value is in the iteration loop, not the individual prompt.
Curious—did you notice certain types of sessions (like debugging vs planning) being more “expensive” in tokens?