I'm building an AI product. The main model behind it costs 5 per million input tokens and 15 per million output tokens. That's Claude Sonnet. Opus is even worse: 15/75.
There's no way to offer a flat fee at those prices. Every interaction has a meter running. And the platform takes a cut on top. So your users burn credits. You burn margins. Everyone feels it.
I didn't want to ship a credits system. I still don't. But with Anthropic's pricing, flat fee is a bet you lose.
So I started looking elsewhere.
GLM-5.1 costs 0.95/3.15 per million tokens. That's roughly 5x cheaper than Sonnet. And it benchmarks in the top 2-4% on most things that matter. Agentic capabilities are solid. I've been testing it for a week. It works.
I tried Minimax M2.7 before that. Great on paper. But it would randomly output Chinese characters mid-sentence in English and French. Not reliable. Not something you ship to users.
Here's what I didn't expect: the real savings aren't the main model. It's the sub-agents.
My product runs multiple agents. A main one that reasons, and smaller ones that execute specific tasks. The small ones don't need Sonnet. They don't even need GLM. Mistral has these tiny models that are fast, cheap, and surprisingly reliable. Almost like calling a regular API. You send a prompt, you get an answer, you move on. No 15-second "thinking" phase. No 0.03 per call for something a 0.001 model handles fine.
So the stack is becoming: GLM for the heavy lifting, Mistral small models for the rest. Total cost per interaction drops by something like 5-8x. Flat fee starts to look possible again.
I might be wrong about GLM long term. The model is young. Anthropic's updates are smoother, their ecosystem is more mature. But right now, for a solo founder trying to ship without credits, it's the best trade I've found.
Has anyone else moved away from closed model APIs for production? What did you lose that you didn't expect?
This is the right direction — most people focus too much on the main model and ignore system design.
The real leverage is in routing tasks to the cheapest model that can handle them.
Curious — has mixing models caused any consistency issues, or is it stable enough for production?
Thanks Aryan! It's actually very stable. The secret is to give the smaller models strictly scoped, API-like tasks with rigid prompts. As long as the main model handles the complex reasoning, consistency stays solid in production.
Yeah — that works, but the real risk isn’t consistency, it’s perception.
Most setups like this technically work,
but to the user it still feels:
slightly inconsistent / unpredictable
And once that happens, they don’t care that you saved 5x on cost.
So the leverage isn’t just routing —
it’s making the whole system feel like one consistent product.
That’s usually where products either become:
→ “cheap infra trick”
or
→ something people actually trust and stick with
Right now it still sounds more like the first.
Have you thought about how this reads to a user, not just how it performs?