We were using AI from 4 different providers and could not reconcile the bill, so we built EvoLink

by evan dong

Hey IH — sharing a problem we kept running into while using AI from multiple providers.

Using different models was easy enough at the beginning.

One workflow used OpenAI.

Another used Claude.

A few experiments used Gemini or smaller models.

Some tools had their own keys. Some scripts had separate env vars. Some calls were made through agents. Some were just quick tests that never got cleaned up.

The model calls worked.

The bill did not.

The hard part was not only “which model is cheaper?”

It was answering basic questions across providers:

which project used which model?
which provider was responsible for the spike?
was this real user usage, testing, retry noise, or fallback traffic?
did switching to a stronger model actually improve the result?
how much did one workflow cost end to end, not just one API call?

Provider dashboards were useful, but they were all separate.

Each one had its own pricing, usage view, API key structure, and reporting format.

So we kept ending up with a messy spreadsheet just to understand what happened.

That is why we started building EvoLink.

The idea is simple:

one API layer for calling different AI providers,

and one usage layer for understanding where the money went.

The minimal thing we wanted to track per request:

workflow_id
run_id
project_id
provider
model
input_tokens / output_tokens
cost
fallback_of
retry_of
outcome_status

This changed how we looked at AI spend.

Instead of asking:

“Which provider is cheaper?”

we started asking:

“Which workflow needs which model, and what did that workflow cost across all providers?”

Still early, but this has already made routing and cost discussions much clearer.

Curious how other builders are handling this.

If you use more than one AI provider, are you reconciling usage manually, relying on each provider dashboard, or already routing everything through one layer?

there is our website: https://evolink.ai/?utm_source=indiehackers&utm_medium=community_post&utm_campaign=building_in_public_cross_provider_billing_202606&utm_content=multi_provider_bill_post

evan dong

posted to

Building in Public

on June 1, 2026

Say something nice to evan66…

Post Comment

1

This is the exact reason I think AI gateways need a run receipt, not just a provider dashboard.

The part that gets missed is the balance bucket / settlement bucket. A request can look like one workflow from the user side, but internally it might start on a cheaper route, retry, fall back to an official route, or use a stronger upstream model for one step. If the ledger only says provider + model + tokens, support still has to reconstruct why the balance moved.

For Tokens Forge I ended up separating the receipt into requested model, upstream model, primary/backup route, retry/fallback state, token usage, and which balance paid. That makes the conversation less about "which provider is cheapest" and more about "which workflow created this spend, and was the fallback worth it."

tokensforge

·
24 days ago
·
Reply
1

the hidden problem is not the bill, it is the lack of one place to see what was spent and why.

sweeeeeft

·
2 months ago
·
Reply
1

We hit the exact same thing. Ended up with a spreadsheet comparing Claude and Gemini costs per workflow which felt ridiculous. Interested to see where this goes.

hamco1303

·
2 months ago
·
Reply
1

The bill reconciliation pain is real — I'm running Claude + OpenAI fallback on my WhatsApp agent and even with 2 providers the cost attribution per customer conversation is a manual mess. Per-tenant token tracking + per-call cost annotation in the DB is the first thing I'd build into any LLM app from day one. Curious what your abstraction layer looks like.

worvi26

·
2 months ago
·
Reply
1

I hit a smaller version of this with OpenAI, Claude, and a couple one-off scripts, the spreadsheet cleanup was worse than the model swap. OpenRouter helps for the gateway piece, Portkey is decent for observability, I built PrivacyForge for the trust-doc side because once Stripe, analytics, or another provider gets added the policy layer drifts too. The wedge I'd pay for here is workflow-level cost plus fallback attribution in one place, thats the bit the native dashboards still miss.

mer

·
2 months ago
·
Reply
1

This is the AI version of the cloud-bill problem from a decade ago, except it shows up faster because tokens scale with usage in real time. Aggregating spend across providers is table stakes. The piece that actually changes decisions is attribution: tying spend to a specific workflow or customer so you can answer "is this feature even profitable per call," not just "what did we spend." We hit the same wall running AI inside SocialPost, and the moment we could see cost per workflow we cut two features that looked fine on the dashboard and were underwater per use. One thing I'd pressure-test: the gateway and router proxies people put in front of multiple providers are the natural home for this data. Is EvoLink a layer on top of those, or are you betting they won't move into cost attribution themselves? I'd want that answered before building deep here.

GregoryScottHenson

·
2 months ago
·
Reply
1

The best products often come from solving your own problems first. If a pain point is frustrating enough for your team, chances are other businesses are facing the same challenge. Building a solution around a real operational problem usually leads to much stronger product-market fit.

UnitMorph

·
2 months ago
·
Reply
1
This brought back some painful memories
We were in the exact same spot, four providers, four dashboards, one spreadsheet nobody trusted. But our breaking point wasn't the bill. It was realizing we had no idea if any of that spend was actually working.

Tokens out. Nothing measurable back.

That's what led us to build AmpPilot, helping teams create organic content that actually compounds automatically. Content that gets cited by AI, shared in communities, builds pipeline without paying for every eyeball.

You're solving problems that every startup founder wants to solve:
1. "Where did the money go."
2. "Did the output earn anything back."
  Same chaos, but with a different angle.
Amppilot_founder12

·
2 months ago
·
Reply
1

I ran into exactly this last quarter. We had 3 providers (OpenAI, Claude, Gemini) and I was tracking costs in a Notion table that got outdated within a week. The "which workflow needs which model" question is the right framing — we wasted weeks optimizing per-call cost when the real issue was that one workflow was silently calling Claude 3.5 Sonnet for a task that GPT-4o-mini handled equally well at 1/10th the cost. Having that workflow-level visibility would have caught it in a day. Are you planning to expose cost-per-workflow as a first-class metric in EvoLink?

Loviz

·
2 months ago
·
Reply
1

We ran into the same mess — multiple providers, multiple dashboards, and one giant spreadsheet nobody wanted to maintain. The workflow‑centric view you’re building makes way more sense than chasing per‑call costs. Curious to see how EvoLink handles retries/fallbacks since that’s where our hidden spend piled up

curious_builder

·
2 months ago
·
Reply
1
This is a real pain point that's way more common than people talk about. Managing multiple AI providers is easy when you're prototyping, but the billing abstraction layer breaks down fast in production.

A few things I've hit with similar setups:
1. Token counting inconsistency — different providers count tokens differently (especially with images, system prompts, and tool calls). So your internal estimates drift from their bills.
2. Mid-month model deprecations — you build cost projections around a model, they deprecate it, you switch, and suddenly your unit economics look completely different.
3. Retroactive pricing changes — some providers update pricing on cached tokens, batch tiers, etc. and don't make it obvious.
The observability problem you're solving (tying a "workflow" to its actual cost across multiple calls + retries + fallbacks) is the real unlock. That's where current tooling is genuinely weak.

Two questions: how are you handling model versioning in cost attribution (e.g. gpt-4o vs gpt-4o-2024-08-06 having different pricing), and do you support team-level cost attribution or is it purely per-key?
ayush523

·
2 months ago
·
Reply
1

ok this is exactly the problem i went down the rabbit hole on (i build in the llm gateway/cost space too). your per-request fields are right, but the thing that bit me hardest: one logical "outcome" almost never maps to one call. a single user action fans out into retries, a fallback to a stronger model, plus agent sub-calls. so per-call rows don't actually answer "what did this workflow cost" until you give them a parent trace id and roll them up. your fallback_of / retry_of fields are the right instinct, you basically need the full tree not just the edge.

second thing nobody warns you about: your metered token count will drift from the provider's invoice. caching, batch discounts, committed-use pricing all mean tokens != dollars cleanly. i had to reconcile against the actual billed amount per provider or the numbers slowly lied to me.

honestly the gateway part is the easy 20%. the attribution semantics is the real product. curious how you're handling agent fan-out right now, flat or as a trace?

ravirdp

·
2 months ago
·
Reply
1

Hit this exact problem building EarningsScores. I use Claude for scoring earnings reports and the Anthropic dashboard is fine for totals, but once you're running parallel calls across multiple tickers in a serverless function, you lose any sense of which workflow drove which cost. I ended up just adding a custom logging middleware to track tokens per function call, but it's fragile. The "which workflow needs which model" framing is the right question — that's what I'd want answered per-call. Will check out EvoLink.

EarningsScores

·
2 months ago
·
Reply
1. 1
  
  That earnings-report use case is a great example.
  
  Parallel calls across tickers are exactly where provider dashboards stop being useful, because the total cost is visible but the reason behind the cost is not.
  
  A custom logging middleware is usually the first reasonable step. The problem is that it becomes fragile once you add retries, fallback models, version changes, or more workflows.
  
  The question I’d want answered per run is: which ticker/workflow used which model, what did it cost, and did the output quality justify that model choice.
  
  That’s the direction we’re trying to make easier with EvoLink.
  
  evan66
  
  ·
  2 months ago
  ·
  Reply
  1. 1
    
    The output quality question is the one I can't answer yet. Right now I track cost per ticker run but I have no systematic way to know if a Claude Sonnet score was better or worse than a Haiku score for the same report. I just... assume Sonnet is better. That's not a great assumption.
    
    The logging middleware I built works fine for the happy path but you're right that it started breaking down once I added retries. A failed call gets logged differently than a successful one and my cost estimates started drifting.
    
    Checking out EvoLink — that per-run attribution is exactly what I'd need before I could justify switching models mid-flight based on actual quality data.
    
    EarningsScores
    
    ·
    a month ago
    ·
    Reply
1

the spreadsheet to understand what happened is exactly where every multi-provider setup ends up eventually. not because people want a spreadsheet but because nothing else gives you a cross-provider view. the problem is well-named here

adin_builds

·
2 months ago
·
Reply
1

The provider dashboard issue is very real, and it becomes especially important once workflows get more sophisticated. As soon as retries, fallbacks, cached tokens, and a handful of test runs are all in the mix, a simple statement like “OpenAI was expensive this month” stops being very useful.

I’d make the distinction between estimated runtime cost and reconciled billing cost extremely clear. That separation builds trust. People will naturally rely on the workflow-level numbers, but that trust can disappear quickly if the monthly invoice tells a different story even once. The more transparent the dashboard is about what is estimated versus what is finalized, the more confident users will feel using it.

glyphharborhq

·
2 months ago
·
Reply
1

I NEED A TRUSTED CRYPTO HACKER THAT CAN RESTORE LOST OR SCAMMED FUNDS.

Are you struggling to get back the money you lost? Every day, countless individuals face the devastating impact of scam operations that drain their hard-earned savings. But there’s good news – GEO COORDINATES RECOVERY HACKER are here to help you recover what’s rightfully yours. I lost my entire savings to a fake crypto investment scam while I was looking for a way to double my savings. After many weeks of trying to find a way to get my money back with no success, I finally came across a crypto recovery company GEO COORDINATES RECOVERY HACKER, a reliable and trustworthy crypto recovery company. I'm immensely grateful for his dedication, professionalism, and unwavering support. You can get in touch with them through below contact details

WhatsApp ; +1 ( 318 ) 203-3657

I had to send out my review also. They are indeed recommendable.

brunojames

·
2 months ago
·
Reply
1

One field I would add is more security/ops than finance: spend_authorization_source.

For each call, I would want to know whether it came from an authenticated user action, a cron job, a retry, a fallback, a test run, or an admin/internal action. When AI spend spikes, that distinction matters as much as provider/model.

It also gives you a clean place to add guardrails later: per-user caps, per-workflow caps, retry ceilings, and alerts when a public endpoint starts creating paid calls without a clear owner.

RunProbe

·
2 months ago
·
Reply
1

The reconciliation pain is real, but the number that actually changes decisions is cost per outcome, not cost per provider. We route AI spend at my SaaS and the metric I watch is what one completed job costs and what one customer costs me per month, because that is what tells me whether a plan is underpriced. Tracking fallback_of and retry_of is the smart part of your schema. Retry and fallback noise is exactly where margin quietly leaks. One honest business question: once you sit between the app and the providers, you become a dependency on someone's critical path and a new line item on their bill. What is the wedge that stops a team from copying your schema into their own dashboard once they have seen it works?

GregoryScottHenson

·
2 months ago
·
Reply
1

This framing makes sense. In practice the retry/fallback fields are the ones I would want most, because otherwise a “model is expensive” discussion can hide the real issue: one noisy workflow or fallback loop eating the budget.

Have you found workflow-level cost more useful than project-level cost so far?

s6

·
2 months ago
·
Reply
1

Yes, this is a problem that many people likely have, and at first glance, no one seems to have a solution yet, at least not to my knowledge... I hope the rest of your journey is a great success; it will be a sign that the tool solves the identified problem. Good luck!

JoaoPaulo

·
2 months ago
·
Reply
1

Hit this exact wall on the cloud side last year, running across AWS, GCP, and Azure, each tag things differently, bills in a different timezone, and refunds show up at random. We ended up building an internal mapping layer just so finance would trust the numbers. Curious how you're handling metadata across the 4 providers. Did you normalize
Everything to one taxonomy or keep them separate with a join layer on top?

muskan_00

·
2 months ago
·
Reply
1. 1
  
  That cloud analogy is very close to how we think about this.
  
  The direction is a shared taxonomy on top, while still preserving provider-specific fields underneath. So teams can compare usage across providers without losing the raw details needed for debugging or billing edge cases.
  
  The tricky part is deciding which fields should be universal and which should stay provider-specific. Timezone, retries, cached tokens, and failed calls are exactly where it gets messy.
  
  evan66
  
  ·
  2 months ago
  ·
  Reply
1

the messy spreadsheet reality is so incredibly real. the exact second you split your automation pipelines across openai, claude, and gemini, financial observability completely breaks down. you end up guessing which micro-feature or script testing session caused a sudden cost spike overnight.

unifying this into a single abstraction layer with structured metadata like project_id and outcome_status is the only way to scale without getting a heart attack from the provider bills. moving the conversation to 'what does this specific workflow cost across providers' is a brilliant framework shift. you nailed the problem statement perfectly. major props on getting this out there.

Eva_NomadOS

·
2 months ago
·
Reply
1. 1
  
  Exactly. The painful part is not just the bill itself, but losing the connection between cost and the actual workflow that created it.
  
  project_id, feature, environment, provider, model, and outcome_status are the kinds of fields we think matter most.
  
  Once that metadata exists, routing and cost control become much less guessy.
  
  evan66
  
  ·
  2 months ago
  ·
  Reply
  1. 1
    
    over-engineering the early architecture is a silent killer for solo projects. if a simple manual webhook can validate the core transaction flow today, do that instead of spinning up a heavy microservice stack.
    
    keep the footprint tiny and focus entirely on getting your first few paying users. the tech debt only matters once you actually have a consistent data stream to support it.
    
    Eva_NomadOS
    
    ·
    2 months ago
    ·
    Reply
1

The capture side looks solid. The part I'd push on is whether your computed cost ties back to each provider's actual invoice at the end of the month, because token counts times list price almost never matches what gets billed. Discounts on cached tokens, minimum charges, failed calls that still cost you, free credits, price changes partway through the month, conversion fees, they all open a gap between the sum of your tracked runs and what the provider really charged. People trust the workflow numbers right up until the real bill disagrees with the dashboard once, then they stop trusting any of it. I've spent a lot of time reconciling Stripe data and that gap is always where it gets hard. Do you reconcile against the provider invoices, or estimate from tokens and list prices? If you actually reconcile, I'd put that front and center, since it's the part most tools skip.

jakehoffman

·
2 months ago
·
Reply
1. 1
  
  This is a very fair push. Token count × list price is useful for real-time visibility, but it is definitely not the same thing as invoice reconciliation.
  
  This is actually something we’ve been struggling with for a while too. We haven’t found a really clean solution yet, especially once cached tokens, failed calls, provider-side discounts, free credits, and mid-month pricing changes get involved.
  
  Right now, the layer we’re focused on first is workflow-level attribution: project, feature, provider, model, retries, fallback, and outcome. But I agree that the next step has to separate “estimated runtime cost” from “reconciled billing cost” much more clearly.
  
  evan66
  
  ·
  2 months ago
  ·
  Reply
1

This is a real infrastructure pain because once a team uses more than one model provider, “AI cost” stops being a provider-dashboard problem and becomes a workflow-accounting problem.

The sharpest angle here is not just cheaper routing. It is visibility across AI usage: which workflow caused the spike, whether retries or fallbacks distorted the bill, and whether the stronger model actually improved the outcome enough to justify the cost. That feels much more valuable than another generic AI gateway.

One thing I’d pressure-test early is the product name. EvoLink is decent, but “link” makes it feel more like a connector layer. Your stronger category may be AI cost intelligence, model usage observability, and cross-provider routing control.

Exirra .com would fit that direction better because it feels more like an AI infrastructure and signal-intelligence brand, while still leaving room for routing, usage tracking, cost attribution, fallback analysis, and workflow-level AI spend visibility.

aryan_sinh

·
2 months ago
·
Reply