1
0 Comments

I thought limiting users to “N requests per day” was enough for CoTel. Turns out — it’s a path to bankruptcy.

Recently I wrote about how the first CoTel users showed me workflows I had never even considered because I originally built the service mostly for myself. I’m very grateful to everyone who commented — it genuinely changed the way I now think about the product.

Today I want to talk about the next problem I got stuck in. And again — I’m writing this honestly, without pretending I already have the perfect answer. I’d especially appreciate advice from people who have already built products around LLM APIs and dealt with the real economics behind them.

When I started, my pricing model seemed perfectly logical:
users have a subscription plan, each plan has request limits per day/month, plus limits on Telegram history depth.

Free → up to 7 days history
Basic → up to 30 days
Pro → up to 60 days

Simple and predictable. Easy to explain.

Then I connected Claude. And I made a test query analyzing a Telegram chat over 30 days. One request cost me around $0.50.

Now imagine a Pro user paying $24/month. Let’s say their plan includes ~1500 requests monthly. If even 5–10% of those requests are large enough to cost that much — I’m already losing money. And if someone makes those requests all day long?

Suddenly I’m paying $250–500/month in API costs for a $24 subscription.

One heavy user can easily destroy the margin from ten normal users.

And the worst part is: I can’t reliably predict in advance who that user will be.

In my previous post I mentioned feedback from a journalist who uses Telegram as a research environment and information source. One of the most valuable features for him would be asking questions across entire groups of chats at once.

He has dozens of chats organized into topic folders and wants the AI to search through all of them together.

It’s a great idea. I’m already thinking about how to implement it.

And that’s when it fully hit me: one grouped query is not “one request.”

It may actually be 10–20–30 requests happening simultaneously under the hood.

If each costs me $0.10–0.50, then one button click suddenly becomes $5–15 in API costs on my side — while the user still perceives it as “one request.”

That’s when I realized my entire “N requests per day” model fundamentally breaks down. Not all requests are equal.

A quick question over one day of history in a small chat costs almost nothing. A 60-day deep analysis of a large chat on a premium model costs dollars. A grouped analysis across 20 chats may cost tens of dollars in one click. And yet my current system treats all of them identically.

Right now I’m leaning toward replacing “requests” with AI credits.

Basically an internal currency tied to the real cost of computation.

The user sees:
“You have 2000 AI credits monthly.”
A quick query costs ~10 credits.
A deep analysis costs ~100–300.

Meanwhile the backend calculates actual token usage, applies a safety margin, and deducts credits accordingly.

Many AI products already work this way, and honestly it feels like the most reasonable compromise:
users don’t need to understand tokens,
but they immediately understand “cheap vs expensive.”

Tokens stay internal accounting.

At the same time, I’m seriously considering removing model selection from the main interface entirely.

Right now users can explicitly choose GPT or Claude.

But honestly? Most people don’t understand model differences. And they shouldn’t have to.

It probably makes more sense to expose analysis modes instead:

Fast
Balanced
Deep

And let the backend decide what runs underneath:
Gemini Flash for speed,
GPT-4.1 mini for balanced usage,
Claude Sonnet for deeper reasoning.

Users care about outcomes, not provider names.

At the moment I’m testing:
— OpenAI GPT-4.1 mini as the cheap default
— Claude Sonnet as premium deep-analysis mode
— Gemini 2.5 Flash as a third option

Gemini is dramatically cheaper than Claude and surprisingly strong on long context windows. My current suspicion is that for many workloads — especially long Telegram histories — Gemini may perform close enough while costing 5–10x less.

If that turns out to be true, it changes the entire economics of the product.

The strangest part of all this is that I probably would’ve discovered these problems much later without real users and their workflows.

When you test your own product, your behavior is predictable.

When real people arrive with completely different workflows, you suddenly see where the real limits of your system actually are.

Right now these are the questions I still don’t have confident answers to:

  1. How many credits should each plan include?

Right now I’m thinking backwards from target margins.
If I want ~50–60% gross margin:
Basic at $9 might allow ~$3 monthly LLM spend per user.
Pro at $24 maybe ~$10.

But that’s only a hard cap.
Real users use much less.

So how do you balance:
“don’t get destroyed by heavy users”
vs
“give enough value for users to actually feel the product”?

  1. Should users even see model names?

Has anyone here moved from:
“Choose GPT / Claude”
to
“Choose analysis mode”?

Did users find it clearer? Or did advanced users complain?

  1. How should subscriptions and grouped workflows be priced?

A subscription checking new Telegram messages every 30 minutes creates dozens of background AI operations daily that the user doesn’t even consciously think about.

Should those consume credits at full price?
Reduced price?
Should they run through provider batch APIs for cheaper costs?

  1. Long-context model experience

If you’ve worked with 50K+ token contexts:
which models handled it well?
Which models collapsed halfway through?
Especially interested in Russian-language or mixed-language content experience.

  1. Additional credit purchases / top-ups

Anthropic already does this:
users hit limits → buy extra credits → continue without upgrading plans.

Has anyone implemented this themselves?
Any pitfalls around fraud, refunds, accounting?

Right now I’m finishing Gemini integration and optimizing Telegram history preprocessing before sending it into LLMs.

And another interesting discovery:
simply cleaning low-value noise from chat history —
“ok”
“yeah”
emoji-only messages
system events like “X joined the group”

— can reduce token usage by 30–60% almost for free.

Honestly, it’s probably the cheapest optimization I’ve seen in AI products so far.

After this I’ll start redesigning the limits system entirely:
moving from “request counts”
toward
“credits + analysis modes + depth limits + subscription limits.”

But everything I wrote above is still just my current hypothesis set.

This is my first product, and in this particular area — LLM economics — I don’t have a mentor.

So right now I’m heavily relying on people with experience.

If you’ve gone through something similar, made mistakes, redesigned your pricing, or discovered things that unexpectedly worked (or failed badly) — I’d genuinely love to hear about it.

Thanks for reading.
Seriously — these discussions help me avoid building inside my own bubble.

on May 21, 2026
Trending on Indie Hackers
AI runs 70% of my distribution. The exact stack. User Avatar 147 comments I'm a solo founder. It took me 9 months and at least 3 stack rewrites to ship my SaaS. User Avatar 137 comments Show IH: I'm building a lead gen + CRM tool for web designers targeting local businesses without websites — starting with Spain User Avatar 79 comments I built a URL indexing SaaS in 40 days — here's the honest story User Avatar 58 comments We could see our AI bill, but not explain it — so I built AiKey User Avatar 25 comments AI coding should not turn software development into a black box User Avatar 11 comments