Before You Pick an LLM for Your SaaS, Test It on Real Workflows

In my previous post, I wrote about how I started rethinking the pricing model for my AI SaaS that analyzes Telegram chats — and ended up moving toward an LLM router.

Now I want to share some concrete findings from the tests.

I ran real workflows on the same data: finding recommendations, creating digests, analyzing sentiment shifts, finding opposing opinions, analyzing links, cross-chat analysis, and checklist-style tasks.

I tested different models, including:

— Gemini Flash Lite
— Gemini 2.5 Flash
— Gemini 2.5 Pro
— GPT 4.1 mini
— GPT 4.1
— GPT 5.4 mini
— Claude Haiku 4.5
— Claude Sonnet 4.6
— o4-mini

The main takeaway: “one request” is a bad unit of measurement for an AI SaaS.

From the user’s perspective, everything looks the same: they just type a question. But internally, one request can be a short factual lookup, while another can be an analysis of several Telegram chats over a long period, with grouping, conclusions, and quotes.

That’s why I’m moving away from “N requests per month” toward token/credit-based limits. Each plan will include a certain amount of tokens/credits, and each request will be priced based on the model, context size, analysis depth, and estimated LLM cost.

A few findings from the tests:

1. The expensive model is not always better

In one simple QA test, Claude Sonnet 4.6 cost about $0.41 per request, while Gemini Flash Lite cost about $0.03.

That’s roughly a 14x difference.

But Sonnet did not produce a result that justified the extra cost. In some workflows, the expensive model was not “slightly better” — it was comparable to, or even less useful than, light/balanced alternatives.

Claude Haiku 4.5 also turned out to be less cost-effective than I expected: on long Russian-language chats, its tokenizer made it noticeably more expensive.

2. But expensive models are still useful — in specific cases

Sonnet performed well on “filter + rank + justify” tasks.

For example: finding apartment listings by price, size, and neighborhood, grouping them, and selecting the top 5. In that case, it handled constraints better and explained more clearly why each option did or didn’t fit.

The conclusion: this type of model should not be the default. It should be a premium reserve for specific task classes.

3. A balanced model can cover most real workflows

Gemini 2.5 Flash was surprisingly competitive with deep models across many tasks: digests, discussion dynamics, conflicting opinions, and cross-chat analysis.

In one cross-chat test across 4 Telegram chats, it produced the most detailed report — around 16,000 characters — for about $0.06.

Sonnet on the same task cost about $0.57, almost 10x more, but did not provide a proportionally better result.

4. GPT 4.1 mini had a clear niche

GPT 4.1 mini worked well for practical requests: “give me a plan,” “make a checklist,” “what should I do step by step.”

In one test, it cost about $0.009 and followed the requested structure better than the other models: key points, advice, checklist, and missing information.

GPT 4.1 full, on the other hand, was problematic for long contexts: it hit limits on larger histories and did not look cost-effective enough for production scenarios.

5. Reasoning models were not a universal solution

o4-mini and GPT 5.4 mini performed well on structured tasks like “Position A vs Position B.”

But in messy human discussions — with emotion, chaos, and subtle sentiment shifts — their advantage was inconsistent.

The conclusion: reasoning is useful, but only for the right class of tasks.

What I’m changing now:

— moving from request-count limits to token/credit-based limits;
— introducing 3 analysis levels: light, balanced, deep;
— letting the user choose analysis depth, not a specific model;
— letting the router choose the LLM based on that level and the request type;
— adding fallback chains for heavier modes when the primary model is unavailable.

My main takeaway after these tests:

It’s hard to price an AI SaaS properly if all requests are treated as equal.

And it’s hard to optimize quality if the product depends on a single model.

I think more AI products will end up being built around this chain:

analysis depth → suitable model → cost calculation → fallback.

Curious how other AI SaaS founders are handling this: do you let users choose a specific model, or do you hide that inside the product?