I thought limiting users to “N requests per day” was enough for CoTel. Turns out — it’s a path to bankruptcy.

Recently I wrote about how the first CoTel users showed me workflows I had never even considered because I originally built the service mostly for myself. I’m very grateful to everyone who commented — it genuinely changed the way I now think about the product.

Today I want to talk about the next problem I got stuck in. And again — I’m writing this honestly, without pretending I already have the perfect answer. I’d especially appreciate advice from people who have already built products around LLM APIs and dealt with the real economics behind them.

When I started, my pricing model seemed perfectly logical:
users have a subscription plan, each plan has request limits per day/month, plus limits on Telegram history depth.

Free → up to 7 days history
Basic → up to 30 days
Pro → up to 60 days

Simple and predictable. Easy to explain.

Then I connected Claude. And I made a test query analyzing a Telegram chat over 30 days. One request cost me around $0.50.

Now imagine a Pro user paying $24/month. Let’s say their plan includes ~1500 requests monthly. If even 5–10% of those requests are large enough to cost that much — I’m already losing money. And if someone makes those requests all day long?

Suddenly I’m paying $250–500/month in API costs for a $24 subscription.

One heavy user can easily destroy the margin from ten normal users.

And the worst part is: I can’t reliably predict in advance who that user will be.

In my previous post I mentioned feedback from a journalist who uses Telegram as a research environment and information source. One of the most valuable features for him would be asking questions across entire groups of chats at once.

He has dozens of chats organized into topic folders and wants the AI to search through all of them together.

It’s a great idea. I’m already thinking about how to implement it.

And that’s when it fully hit me: one grouped query is not “one request.”

It may actually be 10–20–30 requests happening simultaneously under the hood.

If each costs me $0.10–0.50, then one button click suddenly becomes $5–15 in API costs on my side — while the user still perceives it as “one request.”

That’s when I realized my entire “N requests per day” model fundamentally breaks down. Not all requests are equal.

A quick question over one day of history in a small chat costs almost nothing. A 60-day deep analysis of a large chat on a premium model costs dollars. A grouped analysis across 20 chats may cost tens of dollars in one click. And yet my current system treats all of them identically.

Right now I’m leaning toward replacing “requests” with AI credits.

Basically an internal currency tied to the real cost of computation.

The user sees:
“You have 2000 AI credits monthly.”
A quick query costs ~10 credits.
A deep analysis costs ~100–300.

Meanwhile the backend calculates actual token usage, applies a safety margin, and deducts credits accordingly.

Many AI products already work this way, and honestly it feels like the most reasonable compromise:
users don’t need to understand tokens,
but they immediately understand “cheap vs expensive.”

Tokens stay internal accounting.

At the same time, I’m seriously considering removing model selection from the main interface entirely.

Right now users can explicitly choose GPT or Claude.

But honestly? Most people don’t understand model differences. And they shouldn’t have to.

It probably makes more sense to expose analysis modes instead:

Fast
Balanced
Deep

And let the backend decide what runs underneath:
Gemini Flash for speed,
GPT-4.1 mini for balanced usage,
Claude Sonnet for deeper reasoning.

Users care about outcomes, not provider names.

At the moment I’m testing:
— OpenAI GPT-4.1 mini as the cheap default
— Claude Sonnet as premium deep-analysis mode
— Gemini 2.5 Flash as a third option

Gemini is dramatically cheaper than Claude and surprisingly strong on long context windows. My current suspicion is that for many workloads — especially long Telegram histories — Gemini may perform close enough while costing 5–10x less.

If that turns out to be true, it changes the entire economics of the product.

The strangest part of all this is that I probably would’ve discovered these problems much later without real users and their workflows.

When you test your own product, your behavior is predictable.

When real people arrive with completely different workflows, you suddenly see where the real limits of your system actually are.

Right now these are the questions I still don’t have confident answers to:

How many credits should each plan include?

Right now I’m thinking backwards from target margins.
If I want ~50–60% gross margin:
Basic at $9 might allow ~$3 monthly LLM spend per user.
Pro at $24 maybe ~$10.

But that’s only a hard cap.
Real users use much less.

So how do you balance:
“don’t get destroyed by heavy users”
vs
“give enough value for users to actually feel the product”?

Should users even see model names?

Has anyone here moved from:
“Choose GPT / Claude”
to
“Choose analysis mode”?

Did users find it clearer? Or did advanced users complain?

How should subscriptions and grouped workflows be priced?

A subscription checking new Telegram messages every 30 minutes creates dozens of background AI operations daily that the user doesn’t even consciously think about.

Should those consume credits at full price?
Reduced price?
Should they run through provider batch APIs for cheaper costs?

Long-context model experience

If you’ve worked with 50K+ token contexts:
which models handled it well?
Which models collapsed halfway through?
Especially interested in Russian-language or mixed-language content experience.

Additional credit purchases / top-ups

Anthropic already does this:
users hit limits → buy extra credits → continue without upgrading plans.

Has anyone implemented this themselves?
Any pitfalls around fraud, refunds, accounting?

Right now I’m finishing Gemini integration and optimizing Telegram history preprocessing before sending it into LLMs.

And another interesting discovery:
simply cleaning low-value noise from chat history —
“ok”
“yeah”
emoji-only messages
system events like “X joined the group”

— can reduce token usage by 30–60% almost for free.

Honestly, it’s probably the cheapest optimization I’ve seen in AI products so far.

After this I’ll start redesigning the limits system entirely:
moving from “request counts”
toward
“credits + analysis modes + depth limits + subscription limits.”

But everything I wrote above is still just my current hypothesis set.

This is my first product, and in this particular area — LLM economics — I don’t have a mentor.

So right now I’m heavily relying on people with experience.

If you’ve gone through something similar, made mistakes, redesigned your pricing, or discovered things that unexpectedly worked (or failed badly) — I’d genuinely love to hear about it.

Thanks for reading.
Seriously — these discussions help me avoid building inside my own bubble.

Anastasiia Bashinskaia

on May 21, 2026

Say something nice to StasyBashin…

Post Comment

1

"not all requests are equal" is the structural insight here — once you frame it that way, the credit-per-cost model isn't a workaround for pricing, it's a more accurate representation of what the user is actually consuming. the old model wasn't wrong, it was solving for explainability at the cost of correctness, and that trade only works while the cost curve is flat. yours isn't.

a few angles that might be useful, since you asked specifically about people who've been through this kind of redesign:

on credits-per-plan (your q1) — backwards-from-margin is the right axis, but the trap is using it to set hard caps. what tends to work better is soft caps with burst tolerance: e.g. pro = ~$10 effective monthly spend, but a single power week can go to $20 before throttling kicks in. heavy-user kills usually come from sustained over-cost, not occasional spikes — and the spike weeks are often the highest-retention ones because the user is genuinely getting value. the math: lose 1-2% margin on burst weeks, save 10% churn from users who'd otherwise feel artificially capped.

on model names (your q2) — fast/balanced/deep wins almost universally. the small minority of power users who actually understand model differences usually have an api key already; they're not your subscription growth engine. that said, one thing worth surfacing in the ui is a "why is this deep mode?" tooltip — power users self-educate, normal users skip it. zero cost, recovers the transparency loss.

on grouped/background workflows (your q3) — the asymmetric piece is that background scheduled checks have low perceived value per execution (user didn't push the button), so they should consume credits at a discounted rate AND ideally batch through provider batch apis (claude/openai batch endpoints are usually 50% off for non-realtime). otherwise users will turn off background features the moment they realize they're eating into the credits they wanted for grouped queries.

on long-context (your q4) — gemini 2.5 flash on 50k+ token windows has held up reasonably well in patterns i've seen discussed, especially for retrieval-style queries. where it tends to soften is mid-context reasoning chains (50k token doc + multi-hop synthesis); it'll often miss details that claude catches. for telegram-style "find/summarize across chats" it should perform close enough to justify the cost gap. mixed-language is the bigger question — most coverage of gemini's behavior is english/chinese; russian-specific data is thinner, so probably worth a small a/b test on real cotel content before committing.

on the "no mentor" piece — fwiw, the questions you're asking publicly here are exactly the right ones. most founders skip directly from "i picked a pricing model" to "why is my margin negative" without ever publicly mapping the structural assumption. the post itself is doing more validation work than a mentor conversation would.

still keen on a chat whenever maternity timing works — async over voice notes also genuinely fine, even just 3 specific questions when the baby naps. no pressure.

Gubanchuk

·
a month ago
·
Reply
1. 1
  
  Thanks for such a valuable comment. This is genuinely useful information.
  
  I'm currently testing around 10 models from different providers on Russian-language content across different task categories: retrieval, long-history analysis, grouped queries, reasoning, contradiction detection, and several other scenarios.
  
  The results have been quite surprising. For many tasks, Gemini has actually outperformed every other model I tested, including Claude. Claude only showed a clear advantage on a few tasks involving contradiction detection and more complex reasoning.
  
  That's exactly why I'm now building a routing layer that classifies the user's request and automatically selects the most appropriate model. So far, Gemini and GPT-4.1 mini successfully cover a large portion of the workloads I've tested.
  
  For subscriptions, I'm also leaning toward cheaper models. Subscription jobs run continuously in the background and usually process relatively small amounts of new information, so expensive models often don't provide enough additional value to justify their cost.
  
  I also really like your suggestion about adding explanations for Fast / Balanced / Deep modes. I think I'll use that.
  
  As for soft limits, I haven't implemented them yet, but it's definitely an idea worth keeping in mind.
  
  Regarding Batch APIs, I may be misunderstanding something, but right now they don't seem like a perfect fit for my use case. Many CoTel subscriptions are intended to operate almost in real time — checking for new messages every 15, 30, or 60 minutes and notifying users as soon as relevant events appear. Because of that, introducing significant delays for batch processing could reduce the value of the feature. If I'm thinking about Batch APIs incorrectly, I'd be happy to hear your perspective.
  
  Right now, after the first wave of user feedback and a demo request from a newsroom, I'm making some fairly significant changes to the product. Once that's finished, I'm really looking forward to showing the updated MVP to the Indie Hackers community.
  
  And if it's easier to discuss these topics directly, feel free to message me on Telegram: @panda_ayayai
  
  StasyBashin
  
  ·
  a month ago
  ·
  Reply