12
25 Comments

We could see our AI bill, but not explain it — so I built AiKey

Hey IH — sharing a real problem we kept hitting in production.

By 2026, most teams I work with can see monthly AI totals, but still can’t answer basic questions like:

Which workflow caused the spike?
Was it real usage or retry noise?
Did higher cost actually improve outcomes?
The biggest token drains we keep seeing:

Duplicate calls across tools/agents
Context bloat (too much history per request)
Retry storms during partial failures
The issue isn’t just “high AI cost.”
It’s low visibility + weak controls.

So I built AiKey as a runtime credential + governance layer:

unified access across accounts/keys
request-level attribution by project/workflow/model
policy guardrails (budget alerts, routing, permissions)
What changed for us after implementing this:

cost discussions moved from opinions to evidence
spikes became diagnosable within minutes, not month-end
optimization focused on cost-per-outcome, not just “cheaper calls”
I’m sharing this to compare notes with other builders operating AI in production.

If useful, I can share the exact setup I’m using for this (macOS/Linux):

curl -fsSL https://aikeylabs.com/zh/i/ih01 | sh

Happy to hear feedback on setup friction / missing docs.

posted to Icon for group AI Tools
AI Tools
on May 20, 2026
  1. 1

    This pattern is the same one cloud spend hit around 2014. Everyone could see the AWS bill, nobody could tell you what a specific product line was costing until Cloudability and Apptio tied spend to unit economics. The wedge is real and the timing on AI is right. Two things worth thinking about: cost-per-successful-outcome is the metric finance and product will actually want, not cost-per-request, so how you define a 'workload' matters more than the attribution itself. The curl install works for early adopters but enterprise buyers will need a hosted dashboard before they standardize. Curious whether you're seeing pull from devs or finance first, that usually predicts the GTM motion.

  2. 1

    This is a real pain point. I run multiple LLM APIs across different projects and tracking which calls actually drive value vs which are waste is nearly impossible with just provider dashboards. The "cost-per-outcome" framing is key — most teams only look at total spend, not whether that spend produced better results. Would be interested in seeing the attribution schema if you share it.

  3. 1

    “Cost-per-outcome” is the important part here. Feels like a lot of teams are still optimizing token cost in isolation without knowing whether the workflow itself actually got better.

    Also retry storms + context bloat are painfully real in multi-agent setups 😅

  4. 1

    This is a genuinely useful direction, especially as more teams move from “experimenting with AI” to actually managing AI costs at scale.
    The part about request-level attribution really stands out. Most teams still have one shared API key and a monthly invoice with zero clarity on which workflow, agent, or retry loop caused the spike. That becomes painful fast once multiple tools and automations are involved.
    Also like the focus on virtual keys instead of exposing raw provider credentials everywhere, feels much closer to how mature cloud infrastructure evolved.

    1. 1

      Totally agree — once usage scales, the unit of control has to be the request, not the monthly invoice.
      In AiKey we attribute cost per call using metadata like virtual_key_id, workflow_id, agent_id, run_id, and retry sequence, so spikes can be traced to a specific path instead of “someone used the shared key.”
      We also keep provider credentials isolated behind virtual keys, then enforce policy at the key layer (model allowlist, budget caps, retry limits, and route-level controls).
      That shift — from credential sharing to governed access — is exactly the maturity step AI infra needs.

    2. 1

      This comment was deleted 2 days ago.

  5. 1

    This resonates. Once an AI feature has more than one step, monthly totals stop being useful pretty quickly.

    The thing I’ve found most important is separating “model cost” from “workflow cost.” A single expensive call might be fine if it produces the final user-facing result, while a
    bunch of cheap retries or duplicated intermediate calls can be much worse.

    I’d definitely be interested in the minimal attribution schema. Especially how you label workflow steps and distinguish real user demand from retry / failure noise.

    1. 1

      Exactly — that separation is crucial.
      In AiKey we treat model cost as per-inference spend, and workflow cost as total spend to complete one business outcome (run_id scoped), including retries, tool hops, and intermediate steps.

      Our minimal attribution schema is:

      timestamp, virtual_key_id, workflow_id, run_id, step_id, step_type, model, provider, input_tokens, output_tokens, latency_ms, status, error_code, retry_of, cache_hit, user_trigger_id

      Two practical rules that help:

      Step labeling: stable step_type taxonomy (e.g. plan, retrieve, tool_call, synthesize, final_answer) + optional step_name for team-specific detail.
      Demand vs noise: first attempt linked to user_trigger_id is demand; records with retry_of != null, timeout/error chains, or duplicate payload hashes are operational noise.
      This makes it easy to compute both:

      cost per successful outcome (run_id), and
      waste ratio (retry/duplicate/intermediate cost ÷ total cost).

    2. 1

      This comment was deleted 2 days ago.

  6. 1

    Request-level cost attribution is something every team running AI in production needs and almost nobody has. We run multiple AI agents daily (content generation, security auditing, SEO analysis, competitive intelligence) and our biggest cost surprise was discovering that retry storms during API timeouts were burning 3x the tokens we expected. A single flaky connection would trigger 5 retries, each sending the full conversation history.

    The three cost drivers you identified — duplicate calls, context bloat, and retry storms — are exactly right. We solved retry storms with a circuit breaker pattern (stop retrying after N consecutive failures) and context bloat by aggressively pruning conversation history. But we built all of this manually because nothing off-the-shelf existed.

    The branding question in the comments is interesting. "Key management" undersells what you're building. If you can show a team that Agent X is costing $400/month because it's sending 12KB of context per request when 2KB would suffice, that's not cost monitoring — that's AI architecture consulting delivered as a dashboard.

    1. 1

      This is a great field report — especially the retry storm pattern with full-history replays.
      We now model that explicitly as failure-amplified token burn: retry_count × avg_context_tokens × timeout_window, and alert when the amplification ratio crosses a threshold.

      Your circuit-breaker + aggressive pruning approach is exactly the right baseline.
      In AiKey we pair that with policy controls at execution time: max retry depth, retry backoff ceilings, context budget per step, and duplicate-payload suppression keyed by run_id + step_id + payload_hash.

      And fully agree on positioning: “key management” is too narrow.
      The real value is exposing architectural waste paths (which agent, which step, which payload pattern), so teams can fix design, not just watch spend.

    2. 1

      This comment was deleted 2 days ago.

  7. 1

    The attribution schema piece is the right place to invest. Pattern that works well: treat each AI call as an event fact row with dimensions for model, workflow/agent, project/user, is_retry (boolean), and outcome_status. With that grain, "cost per successful completion" vs "cost per retry storm" is just a GROUP BY — no additional instrumentation needed downstream.

    The spike diagnosis problem is usually a time granularity issue. Most teams aggregate by day, which flattens the shape of a retry storm entirely. Keeping request_timestamp at minute granularity and flagging the first call in a sequence separately makes those patterns visible in any BI tool without building custom anomaly logic.

    Would definitely read a follow-up on your anomaly rules — particularly curious how you're handling partial failure attribution across multi-step agent chains.

    1. 1

      Great points — especially the event-fact grain and minute-level timestamping.
      That’s exactly how we model it in AiKey: one row per call, with run_id and step_id linking calls into a chain, plus is_retry and outcome_status for cost-quality separation.

      For partial failures in multi-step chains, we use two layers:

      Step-level attribution: each step owns its direct token/latency/error cost.
      Outcome-level rollup: final business outcome is evaluated at run_id scope, then failed or degraded runs inherit upstream waste via weighted allocation (by token share + dependency depth).
      So we can report both:

      “where cost was spent” (step truth), and
      “why outcome failed” (run truth),
      without conflating the two.

      1. 1

        The step/run distinction maps cleanly to what I'd call fact-at-event vs fact-at-outcome separation in dimensional modeling — step truth lives in the transaction fact table, run truth rolls up to a session/outcome fact at a different grain. Keeping those separate means your BI layer can answer both questions without forcing aggregation artifacts or losing step-level detail.

        The weighted allocation by token share + dependency depth is the interesting design choice. Are you storing the weights at attribution time (denormalized into each step row) or computing them on read? Storing at write time keeps historical snapshots stable when you later change your weighting logic — but recomputing on read gives more flexibility to backfill corrected attributions. Curious which path you went with and whether you've hit any reprocessing pain yet.

    2. 1

      This comment was deleted 2 days ago.

  8. 1

    This problem hits hard once you cross a certain MRR. Running SocialPost.ai, the moment we passed the threshold where AI spend became a meaningful percent of COGS, the lack of per-customer attribution went from annoying to existential. Most cost tools treat AI like infrastructure. It is closer to variable cost of goods sold. The companies that figure out attribution per workflow and per customer first are the ones who can confidently price tier upgrades. Curious if AiKey supports tagging at the request level by end-user, not just project.

    1. 1

      Absolutely — and we agree AI spend should be treated as variable COGS, not just infra overhead.
      AiKey supports request-level attribution by end-user, not only project/workflow.

      At ingestion time, each call can carry tags like: customer_id, end_user_id, tenant_id, workflow_id, agent_id, run_id, feature_flag, plus optional billing labels (e.g. plan_tier, region).

      That enables direct metrics such as:

      cost per customer / per workflow / per successful completion
      margin by plan tier
      upgrade trigger signals (high-value users with rising successful AI usage vs retry-heavy waste)
      So yes — per-end-user tagging is first-class, and it’s designed for pricing and COGS decisions, not just ops dashboards.

    2. 1

      This comment was deleted 2 days ago.

  9. 1

    Cost without explainability is the new 'we have logs but no traces' problem for AI. Are you attributing spend by prompt/route, by user/session, or by feature? The first surfaces obvious wins, the third is what PMs actually want to act on.

    1. 1

      Exactly — we see the same pattern: cost data without attribution context is basically “logs without traces.”
      In AiKey, we don’t force one lens; we keep all three as first-class dimensions on each request: route/prompt_template, user/session, and feature_id.

      A practical way to use them:

      Prompt/route: fastest for immediate efficiency wins (duplicate calls, context over-send, retry-heavy routes).
      User/session: explains demand quality and behavior-driven variance.
      Feature: decision layer for PMs (unit economics, ROI, tiering, roadmap tradeoffs).
      So the workflow is: optimize at route level, validate at session level, decide at feature level.

    2. 1

      This comment was deleted 2 days ago.

  10. 1

    The 'cost-per-outcome' framing is exactly what's missing from most AI infrastructure discussions right now. As we move from simple chatbot calls to complex, multi-step agentic workflows, the ability to attribute spend to a specific feature or customer outcome is the only way to keep unit economics healthy. Moving from opinion-based optimization to request-level evidence is a game changer for teams scaling their production AI and trying to justify the infra costs at the end of the month.

    1. 1

      Well said — this is exactly the shift we’re seeing.
      Once workflows become multi-step, total token spend is not a useful control metric unless it is tied to outcomes.

      In AiKey, we model this explicitly as:

      cost-per-request (execution truth),
      cost-per-run (workflow truth),
      cost-per-outcome (business truth).
      That gives teams a clean path from infra telemetry to unit economics:
      optimize request waste, stabilize workflow completion, then evaluate feature/customer margin with evidence instead of intuition.

    2. 1

      This comment was deleted 2 days ago.

  11. 1

    This hits a real pain point — we had the exact same problem at my company. Azure OpenAI costs would spike and we'd spend hours cross-referencing logs trying to figure out which pipeline or feature was the culprit. The "see it but can't explain it" feeling is exactly right.

    Quick question: does AiKey break down costs at the prompt/feature level, or is it more at the model/API key level? That granularity question was always our sticking point — knowing we spent $400 on GPT-4 is useless; knowing which endpoint burned $400 is actionable.

    1. 1

      Great question — and we agree that model-level totals are not actionable by themselves.
      AiKey supports attribution below model/API-key level, including endpoint/route, feature_id, workflow_id, agent_id, and prompt template/version (when provided), all at request granularity.

      So instead of “$400 on GPT-4,” you can answer:

      which endpoint burned it,
      whether it was user demand vs retries/duplicates,
      and which workflow step caused the spike.
      Model and key are still preserved as dimensions, but they’re just one slice.
      The operational layer is feature/route/workflow-level attribution.

    2. 1

      This comment was deleted 2 days ago.

  12. 1

    This is a strong infra problem because the pain is not the AI bill itself. It is that teams are running production AI workflows without request-level accountability. Once agents, retries, routing, and context history are involved, monthly spend becomes too blunt to explain what is actually happening.

    The “cost-per-outcome” framing is the sharpest part. That moves AiKey away from being only a key-management layer and closer to AI runtime governance: attribution, policy, routing, anomaly detection, and control at the workflow level.

    The naming is worth taking seriously too. AiKey explains the credential layer, but it may become too narrow if the product grows into broader AI cost governance and runtime control. For that direction, Exirra .com would feel more like infrastructure software, not just an AI key utility.

    1. 1

      Great take — really appreciate this.

      You captured the core problem exactly: the bill isn’t the hardest part, the lack of request-level accountability is. Once agents, retries, routing, and long context chains enter production, monthly totals stop being operationally useful.

      Also +1 on your read of our direction. We started from credential orchestration, but the product is clearly moving toward runtime governance: attribution, policy, routing, anomaly detection, and workflow-level control.

      And thanks for the naming feedback — that’s a very thoughtful point. We’re actively evaluating brand architecture as the scope expands beyond key management.
      Really appreciate you taking the time to write this.

      1. 1

        That makes sense, especially if you are already evaluating brand architecture.

        I would treat that as a near-term decision, not a later polish item.

        AiKey is clear for the credential/key-management wedge, but if the product is moving toward runtime governance, cost attribution, policy, routing, anomaly detection, and workflow-level control, the category becomes much bigger than “AI keys.”

        The risk with waiting is that users, docs, integrations, and early customers start remembering the product through the key-management frame. Then when the product becomes broader infrastructure, the market still describes it by the original narrow wedge.

        Exirra feels stronger for the broader direction because it can carry AI infrastructure, governance, and runtime intelligence without locking you to credentials.

        If Exirra is genuinely close to the kind of brand architecture you are evaluating, message me here or on LinkedIn. I may be able to help with that side privately without turning this thread into a public pricing discussion.

        https://www.linkedin.com/in/aryan-y-0163b0278/

        1. 1

          Appreciate the clarity here — especially the “timing risk” framing.
          Our product surface is already broader than credential orchestration, so naming and category language now affect docs, integration narratives, and buyer expectations later.
          We’re actively evaluating that transition path, and your perspective is very relevant.

          1. 1

            That’s exactly the point I’d act on before the current frame gets too baked in.

            If the product surface is already moving beyond credential orchestration, then the name is not just cosmetic anymore. It will shape how buyers understand the platform, how integrations are positioned, and whether the market sees AiKey as a key utility or a broader runtime governance layer.

            I would not overthink this publicly, but if Exirra is genuinely close to the direction you are evaluating, it is worth pressure-testing privately before more docs, customers, and product language lock around AiKey.

            I sent my LinkedIn above. Happy to discuss fit and ownership there if it is a serious candidate.

        2. 1

          This comment was deleted 2 days ago.

  13. 1

    This comment was deleted 2 days ago.

Trending on Indie Hackers
AI runs 70% of my distribution. The exact stack. User Avatar 180 comments I'm a solo founder. It took me 9 months and at least 3 stack rewrites to ship my SaaS. User Avatar 143 comments I used $30,983 of AI tokens last month in Claude code on $200/mo plan User Avatar 45 comments AI coding should not turn software development into a black box User Avatar 24 comments my reddit post got 600K+ views. here's exactly what i did User Avatar 19 comments