We Surveyed 50 Teams About AI Spend. Only 3 Could Answer. Here's What We Built.

by AiKey Labs

A few months ago, three data points landed in the same week.

Jensen Huang said a $500K engineer should burn at least $250K on tokens. Uber disclosed they'd burned their full-year AI budget in four months — with "zero correlation" between spend and output. Microsoft confirmed they're revoking AI coding tool licenses for thousands of engineers because the bills were unsustainable.

I sat down and asked our own team: what did we spend on AI last month?

Nobody knew. Not engineering. Not finance. Nobody.

That moment pushed us past the tipping point. We had been feeling the pain for months — watching AI bills creep up, wondering if the money was well spent, juggling four provider dashboards. But that week made it clear this wasn't a "nice to have" problem anymore. It was a "stop bleeding or bleed out" problem.

We started building AiKey the following week. Here's what we learned along the way.

The numbers that made us panic

Before we wrote a line of code, we wanted to understand if this was just us, or if the whole industry was drowning in the same problem. So we did something unscientific but telling: we asked around. Informal conversations with 50+ engineering leads and founders — Slack DMs, Twitter replies, coffee chats. One question: "Can you tell me your team's total AI spend last month within five minutes?"

Three could. Three out of fifty-plus.

The rest were in one of two camps. Camp A: "Let me check… actually, we use four different providers, give me a few days." Camp B: "I don't want to look. If I look, I'll have to do something about it."

That second camp was the bigger one.

At the same time, the public data kept piling up:

Uber, Q1 2025: Full-year AI budget consumed in four months. Operations lead followed up with a damning line — there was "zero direct correlation" between token consumption and actual product output. They burned millions and couldn't trace a single dollar to a feature.
Microsoft, June 2025: The Experiences & Devices division announced on June 30 it would revoke Claude Code licenses for thousands of engineers, replacing them with an internal tool. The reason wasn't quality — Claude Code is arguably best-in-class. The reason was pure economics.
Wiz, 2025 State of AI Security Report: 65% of Forbes AI 50 companies had verified sensitive API key leaks on GitHub. Not "might have leaked." Verified. Scanned. Confirmed.
Cursor power user incident: A single dev burned nearly $4 in minutes generating one config file with AI. A config file. Four dollars. Multiply that by 200 engineers, hundreds of AI interactions per day, zero caps — and you're looking at tens of thousands evaporating monthly.

The pattern was unmistakable. AI bills were ballooning everywhere, and almost nobody had a handle on it.

Why this is different from every other cost problem

I've been through the "unexpected cloud bill" panic. Every engineer over a certain age has. You spin up an EC2 instance for testing, forget about it, and a month later AWS charges you an extra $800. Classic.

This is worse. Here's why.

Cloud resources leave footprints. Instance IDs. VPC assignments. Tags. You can usually trace a rogue EC2 to a specific person and project within minutes using the AWS console.

Token consumption doesn't work that way. It's a thousand times finer-grained per individual API call. And the entry points are scattered everywhere — IDE plugins, terminal CLIs, CI pipelines, Slack bots, homegrown agents. If someone in your organization wrote a script that calls GPT-5 in a loop at 3 AM, there's no "console" that shows you that. The first notification is the bill at month-end.

Then there's the multi-provider problem. Cloud spend is mostly one vendor — AWS, GCP, or Azure. AI spend is distributed by default. A typical team uses OpenAI, Anthropic, DeepSeek, Google, and maybe one or two regional providers. Each has its own dashboard, its own pricing model, its own export format. To answer "what did we spend total this month," you're manually opening four or five admin consoles and stitching CSVs. Most teams give up at step one.

And the final layer: incentive misalignment. Cloud costs are at least recognized as an engineering responsibility — FinOps exists, tools exist, the CFO asks the CTO who asks the team lead. AI costs are still treated as "experimentation budget." Nobody owns them. Nobody reports on them. They live in a gray zone between R&D, infrastructure, and "we'll figure it out later."

Later is now.

What we built: the short version

The idea is simple. Put a governance layer between your team and every AI provider. Developers never touch real API keys. They get virtual keys with per-person quotas, rate limits, and model whitelists. Every call is logged and attributed — not "the department spent $50K" but "Alice spent $280 on Claude for this project, Bob spent $420 on DeepSeek for that project."

When a teammate leaves, you revoke their virtual key and the call chain breaks within minutes. No org-wide key rotation. No panic. And if someone accidentally hardcodes their key and pushes it to GitHub — which happens constantly — the damage is contained. You revoke one virtual key, not your entire infrastructure secret.

The proxy transparently intercepts standard API calls, injects credentials, checks quotas, and logs usage. Developers write the same code they always have — it just passes through a governance layer first. Zero code changes on the application side.

On top of that, anomaly detection watches for patterns that break the baseline. If your team normally burns 500K tokens per hour and suddenly spikes to 3 million at 3 AM, alerts fire within minutes — not 30 days later when the invoice lands.

We call this approach TokenOps: the FinOps of the AI era. Metering. Attribution. Budgets. Same playbook, new resource.

What we got wrong (and when we course-corrected)

This is the part I wish more founders talked about openly.

Mistake #1: Starting with dashboards, not alerts.

Our first prototype was a beautiful dashboard. Real-time charts, per-model breakdowns, cost-per-project sparklines. We showed it to early testers and they said "cool" and never opened it again.

What they actually needed: an alert when someone was about to blow through a budget. Not a report they had to remember to check. We pivoted the entire notification system to push-first — the dashboard became secondary, the alerts became primary. Usage dropped 30% in one beta team the week after we shipped budget threshold alerts, simply because people became aware they were approaching limits.

Lesson: people don't check dashboards. They check notifications. Build for the notification, not the dashboard.

Mistake #2: Underestimating the multi-provider normalization problem.

We assumed "token counting is token counting." It's not. OpenAI counts tokens one way. Anthropic counts differently. DeepSeek has its own model. Google bills by characters. Some providers include prompt tokens in cost, others only bill completion tokens. Normalizing all of this into one internal "cost" metric turned out to be a significantly harder engineering problem than the proxy layer itself.

We ended up building a normalization engine that maps every provider's billing model to a standard internal unit — not just token counting, but dollar cost equivalence across models with different per-token pricing. This took three times longer than we budgeted.

Mistake #3: Assuming teams wanted fine-grained controls immediately.

We launched with per-project, per-model, per-person budget controls. Every early tester said "that's too much, just let me set a team cap and see who's spending." We pulled back to two tiers: team-level caps (what most teams want on day one) and granular per-person controls (what they adopt after 2-4 weeks of visibility into who's actually spending). The onboarding friction dropped dramatically.

Mistake #4: Not making the "why" obvious enough.

Our landing page talked about "token governance" and "cost attribution." Crickets. We rewrote it to "stop shipping raw API keys and hoping nothing breaks." Conversion rate tripled. Engineers don't care about governance frameworks. They care about not getting blamed when a $4,000 bill shows up because someone's script went haywire. Lead with the fear, then offer the framework.

The technical architecture (for the curious)

I'll keep this high-level since the details are implementation-specific, but here's the shape:

Local proxy layer. Runs on-prem or in your VPC. Sits between your application code and every AI provider's API. All requests flow through it transparently. No SDK changes. No code changes. Just an environment variable pointing to the proxy instead of the provider.

Virtual key system. Each developer gets a unique virtual key. The proxy maps virtual keys to real provider credentials. Virtual keys have metadata attached — who owns it, what project it belongs to, what models it's authorized for, what the daily/monthly caps are.

Unified metering pipeline. Every call — regardless of which provider it targets — flows through the same metering pipeline. Token count, latency, cost estimate, caller identity, project tag, model, success/failure. All of it lands in a time-series database with per-second granularity.

Anomaly engine. We model normal call patterns per team, per project, per time window. Deviations trigger alerts. Not just "volume up" — we look at model mix changes (suddenly everyone switches to the most expensive model?), geographic anomalies (calls originating from a new region?), and failure rate spikes (looping retries burning tokens?).

Performance. The proxy adds ~5-15ms latency per call depending on the provider and network topology. For interactive use cases (IDE autocomplete, chat), this is imperceptible. For high-throughput batch processing, we batch credential injection and use async logging to keep the overhead minimal.

What we haven't solved yet

Honest assessment of where TokenOps still falls short:

The "should we be spending this" question. We can tell you who spent what, on which model, with surgical precision. We cannot tell you whether that spend was worth it. That's a harder problem — it requires tying token consumption to business outcomes, which is as much an organizational challenge as a technical one. Some teams are experimenting with requiring engineers to tag calls with "purpose" (bug fix, feature development, research, etc.), but adoption is spotty.

Open source model costs. If your team runs local models on their own hardware, the proxy approach doesn't help. You're paying for GPU time, not tokens. Different problem, different solution. We're watching the LLM Ops space closely here.

The model that costs zero today. Some providers offer free tiers or aggressively low pricing to gain adoption. But pricing changes. If you've integrated a model whose cost was "effectively zero" into your critical path, and then pricing shifts, you have no governance layer to catch it. This is a time bomb in a lot of orgs and nobody's talking about it.

Building in public: where we are now

AiKey has been in active development for about a year. We're a small team. We're not a massive platform — we're a focused tool doing one thing and trying to do it well.

If your team is experiencing the same "wait, how much did we spend?" moments, it's at https://aikeylabs.com/zh/i/ih13. Enterprise inquiries: [email protected].

I'd genuinely love to hear how other IH builders are handling this. Are you tracking AI spend? Do you even know the number? If not, is it because it's not a problem yet, or because you're too scared to look?

Drop a comment. No judgment. We're all figuring this out together.

AiKey Labs

posted to

AI Tools

on June 25, 2026

Say something nice to aikeylabs…

Post Comment

1

The strongest part of this is the thing you listed as unsolved: "we can tell you who spent what, we can't tell you if it was worth it." That's the real wedge, and I'd argue it's the whole game long-term.

Cost attribution (virtual keys, per-person metering, anomaly detection) is genuinely useful but it's also where the category gets commoditized fastest. Every provider will eventually ship native spend dashboards, and proxy-layer governance becomes table stakes. The defensible position is the layer above it: tying spend to outcomes. "Alice spent $280 on Claude" is plumbing. "This $280 shipped a feature that would've taken 3 engineer-days manually" is the insight a CFO actually acts on, and nobody does it well yet.

The "tag calls with purpose, adoption is spotty" note is the right instinct hitting the wrong mechanism. Engineers won't manually tag calls — it's friction with no personal payoff. The signal probably has to be inferred, not self-reported: correlate token bursts with commits, PRs, deploys, ticket closures. Messy, but it's the only version that doesn't depend on humans doing extra work they have no incentive to do.

Separately — "lead with the fear, then offer the framework" tripling your conversion is the most portable lesson in this post. It's true across basically every B2B category. Engineers don't buy "governance," they buy "don't get blamed for the $4K bill." Outcome-over-mechanism, fear-over-feature. Most technical founders learn that one too late.

What's the early signal on which buyer feels this most acutely — fast-scaling startups burning runway, or larger orgs where it's a compliance/control issue? Those are different products.

Hire_Hivemind

·
2 days ago
·
Reply
1. 1
  
  Appreciate the depth here — this is exactly the kind of pushback I was hoping to get out of the IH post.
  
  You're right that cost attribution alone is plumbing, and plumbing gets commoditized. The "was it worth it" layer is where the real wedge lives, and I don't think anyone has cracked it yet — us included. The tagging approach was our first swing at it and it failed for exactly the reason you laid out: zero personal incentive for the person doing the tagging. Friction with no payoff.
  
  Inferred signal is the direction we're headed. Token bursts correlated with PR merges, deploy timestamps, ticket closures — it's noisy, but noise is fixable with enough volume. The harder problem isn't the correlation, it's defining "value" in a way that works across teams. A support team's definition of ROI is completely different from an engineering team's, and the same model call can be high-value in one context and waste in another. That's the real unsolved layer.
  
  On buyers: we're seeing two distinct pain patterns. Seed-to-Series A startups feel it as a burn-rate panic — they can't afford a surprise bill but also can't afford to slow down. Larger orgs (200+) feel it as a control and compliance gap, especially in industries where audit trails matter. The trigger moment is different — one is "we might run out of money," the other is "we might fail an audit" — but both lead to the same question: "can you prove this spend was justified?" Right now we're seeing faster conversion with the compliance crowd, but the startup segment has stronger word-of-mouth once they're in.
  
  The "fear leads, framework follows" lesson was one of those things that feels obvious in hindsight but took us way too long to learn. Engineers don't buy governance. They buy cover.
  
  aikeylabs
  
  ·
  a day ago
  ·
  Reply
  1. 1
    
    The cross-team value-definition problem you named is the real wall, and it's worth being honest that it might be unsolvable as a single universal metric. A support team's ROI (tickets resolved, deflection rate) and an engineering team's ROI (features shipped, hours saved) aren't reconcilable into one "value" number. The trap would be trying to build a universal ROI engine. The escape is probably that you don't define value, you let each team define it and you supply the correlation layer underneath.
    
    Concretely: you give engineering the "token bursts → PR merges → cycle time" view and support the "token spend → tickets resolved → deflection rate" view. Same inferred-correlation infrastructure, different outcome metric plugged in per team type. You're not solving "was it worth it" globally. You're giving each team the wiring to answer it in their own terms. That's a configurable framework, not a universal answer, which is both more achievable and more defensible.
    
    On the two-buyer split, the data you have is actually a clear signal: faster conversion with compliance, stronger word-of-mouth with startups. That's a classic tension and the answer usually isn't "pick one," it's "sequence them." Compliance converts faster, so that's your revenue engine now — it funds the company. But word-of-mouth is the cheaper long-term growth channel, so the startup segment is your future. The mistake would be over-indexing on whichever converts fastest this quarter and letting the compounding channel starve.
    
    The positioning question that decides it: the compliance buyer wants "prove the spend was justified" (audit, defensibility, control). The startup buyer wants "don't let the spend surprise me" (caps, alerts, survival). Those are different value props, different landing pages, almost different products. Right now "stop shipping raw API keys and hoping nothing breaks" leans startup-fear. If compliance is converting faster, the page might be underserving the buyer who's actually paying. Worth testing a compliance-led message for that segment specifically and seeing if conversion climbs further.
    
    Which segment do you want to be the company known for in two years? That's the one to build the core narrative around, even while the other pays the bills.
    
    Hire_Hivemind
    
    ·
    a day ago
    ·
    Reply