Why I Stopped Chaining LLM Prompts and Started Building Agent Pipelines Instead

Most people building with LLMs right now are doing some version of the same thing: write a big prompt, maybe chain a few together, cross your fingers, and hope the output doesn't hallucinate your app into oblivion.

I was doing this too. For months. And it worked — until it didn't.

I'm building a web tool builder (Valmera), and the moment I tried to use a single LLM call to handle anything beyond a trivial generation task — say, producing a full page layout with consistent styling, responsive behavior, and actual content structure — the wheels fell off. Not sometimes. Almost every time at scale.

So I went down the rabbit hole of multi-agent architectures. Here's what I've learned after 6+ months of R&D, and why I think most AI-powered dev tools are approaching this wrong.

## The fundamental problem with single-agent generation

When you ask one model to "build me a landing page with a hero section, feature grid, testimonial carousel, and a footer" — you're asking it to simultaneously be:

A designer (layout decisions, spacing, visual hierarchy)
A content strategist (copy, CTAs, information architecture)
A frontend engineer (clean markup, responsive CSS, accessibility)
A QA tester (does this actually work?)

No human works this way. We don't design, write, code, and review in a single stream of consciousness. We iterate. We switch roles. We critique our own work and revise.

So why are we asking AI to do it in one shot?

## What a multi-agent pipeline actually looks like

The architecture I've been researching and building into Valmera breaks generation into specialized agents with distinct roles. Not in a theoretical "wouldn't it be cool" way — in a "this is running in production" way.

Here's the rough shape:

Agent 1 — Planner
Takes the user's intent and produces a structured blueprint. Not code. Not markup. A semantic description of what needs to exist, what each section should accomplish, and how they relate. Think of it as the information architect.

Agent 2 — Generator
Takes the blueprint and produces the actual implementation. This agent is deliberately constrained — it doesn't make design decisions, it executes them. Smaller context window, tighter instructions, more predictable output.

Agent 3 — Critic
Reviews the output against the original intent AND the blueprint. Flags inconsistencies, missing elements, accessibility issues, and broken layouts. This is the agent most people skip, and it's arguably the most important one.

Agent 4 — Refiner
Takes the critic's feedback and patches the output. Not a full regeneration — targeted edits. This is where you get the compounding quality gains that single-shot generation can never achieve.

## The key insight: agents need to disagree

The thing that surprised me most wasn't the architecture itself — it's that the system works because agents have different perspectives. The critic agent is intentionally adversarial. It's not trying to be helpful. It's trying to find problems.

This mirrors how real teams work. Your designer and your engineer will disagree about implementation. Your copywriter will push back on the layout. That tension is where quality comes from.

When I removed the critic agent and just ran planner → generator → refiner, output quality dropped by roughly 40% on our internal benchmarks. The disagreement IS the feature.

## Practical tradeoffs nobody talks about

Latency. A 4-agent pipeline is slower than a single call. Obviously. We're talking 8-15 seconds vs 2-3 seconds for a full page generation. The question is whether your users will tolerate that for significantly better output. In our case, yes — because the alternative is generating something bad fast and then spending 10 minutes manually fixing it.

Cost. More agents = more tokens = more spend. But here's the counterintuitive bit: because the generator agent has a tighter scope, its context window is smaller, and its outputs are more focused. Our per-generation cost is roughly 1.6x a single-shot approach, not 4x.

Orchestration complexity. This is the real cost. Managing state between agents, handling failures gracefully (what if the critic flags something the refiner can't fix?), and ensuring the pipeline doesn't loop indefinitely — this is hard engineering work. There's no framework that solves this cleanly yet.

## Where this is going

The thing I'm most excited about — and this is what we're actively researching at Valmera — is the iteration loop. Right now, the pipeline runs once: plan → generate → critique → refine → done.

But what if the user looks at the output and says "make the hero bigger and change the CTA"? That feedback should flow back into the pipeline, not trigger a full regeneration. The agents should understand what changed and why, and make surgical updates.

This is where multi-agent systems start feeling less like a pipeline and more like a team. Each agent maintains context about previous decisions, understands the user's evolving intent, and adapts.

We're not there yet. Nobody is. But the foundation — specialized agents with distinct roles that can disagree and iterate — is solid, and the results are already dramatically better than single-shot generation.

## If you're experimenting with this

A few things I wish someone had told me earlier:

Start with two agents, not four. Generator + Critic is enough to see massive quality gains. Add the planner and refiner once you've nailed the feedback loop between those two.
The critic prompt matters more than the generator prompt. Spend 80% of your prompt engineering time on teaching the critic what "good" looks like. A mediocre generator with a great critic will outperform a great generator with no critic.
Log everything between agents. You'll need to debug why the pipeline produced a weird output, and without inter-agent communication logs, you're flying blind.
Don't use the same model for every agent. Your planner can be a cheaper, faster model. Your critic might need a more capable one. Mix and match based on the cognitive load of each role.

Happy to go deeper on any of these if there's interest. I've been living in this problem space for a while now and it's one of those areas where the practical knowledge is way ahead of the published research.

I'm building (https://valmera.io) — a web tool builder that uses multi-agent pipelines under the hood. Still early, but the research is turning into real product. If this kind of architecture interests you, I'd genuinely love to hear what you're working on too.

Say something nice to valmera…

1
Most people building with LLMs right now are doing some version of the same thing: write a big prompt, maybe chain a few together, cross your fingers, and hope the output doesn't hallucinate your app into oblivion.

I was doing this too. For months. And it worked — until it didn't.

I'm building a web tool builder (Valmera), and the moment I tried to use a single LLM call to handle anything beyond a trivial generation task — say, producing a full page layout with consistent styling, responsive behavior, and actual content structure — the wheels fell off. Not sometimes. Almost every time at scale.

So I went down the rabbit hole of multi-agent architectures. Here's what I've learned after 6+ months of R&D, and why I think most AI-powered dev tools are approaching this wrong.

## The fundamental problem with single-agent generation

When you ask one model to "build me a landing page with a hero section, feature grid, testimonial carousel, and a footer" — you're asking it to simultaneously be:
- A designer (layout decisions, spacing, visual hierarchy)
- A content strategist (copy, CTAs, information architecture)
- A frontend engineer (clean markup, responsive CSS, accessibility)
- A QA tester (does this actually work?)
No human works this way. We don't design, write, code, and review in a single stream of consciousness. We iterate. We switch roles. We critique our own work and revise.

So why are we asking AI to do it in one shot?

## What a multi-agent pipeline actually looks like

The architecture I've been researching and building into Valmera breaks generation into specialized agents with distinct roles. Not in a theoretical "wouldn't it be cool" way — in a "this is running in production" way.

Here's the rough shape:

Agent 1 — Planner
Takes the user's intent and produces a structured blueprint. Not code. Not markup. A semantic description of what needs to exist, what each section should accomplish, and how they relate. Think of it as the information architect.

Agent 2 — Generator
Takes the blueprint and produces the actual implementation. This agent is deliberately constrained — it doesn't make design decisions, it executes them. Smaller context window, tighter instructions, more predictable output.

Agent 3 — Critic
Reviews the output against the original intent AND the blueprint. Flags inconsistencies, missing elements, accessibility issues, and broken layouts. This is the agent most people skip, and it's arguably the most important one.

Agent 4 — Refiner
Takes the critic's feedback and patches the output. Not a full regeneration — targeted edits. This is where you get the compounding quality gains that single-shot generation can never achieve.

## The key insight: agents need to disagree

The thing that surprised me most wasn't the architecture itself — it's that the system works because agents have different perspectives. The critic agent is intentionally adversarial. It's not trying to be helpful. It's trying to find problems.

This mirrors how real teams work. Your designer and your engineer will disagree about implementation. Your copywriter will push back on the layout. That tension is where quality comes from.

When I removed the critic agent and just ran planner → generator → refiner, output quality dropped by roughly 40% on our internal benchmarks. The disagreement IS the feature.

## Practical tradeoffs nobody talks about

Latency. A 4-agent pipeline is slower than a single call. Obviously. We're talking 8-15 seconds vs 2-3 seconds for a full page generation. The question is whether your users will tolerate that for significantly better output. In our case, yes — because the alternative is generating something bad fast and then spending 10 minutes manually fixing it.

Cost. More agents = more tokens = more spend. But here's the counterintuitive bit: because the generator agent has a tighter scope, its context window is smaller, and its outputs are more focused. Our per-generation cost is roughly 1.6x a single-shot approach, not 4x.

Orchestration complexity. This is the real cost. Managing state between agents, handling failures gracefully (what if the critic flags something the refiner can't fix?), and ensuring the pipeline doesn't loop indefinitely — this is hard engineering work. There's no framework that solves this cleanly yet.

## Where this is going

The thing I'm most excited about — and this is what we're actively researching at Valmera — is the iteration loop. Right now, the pipeline runs once: plan → generate → critique → refine → done.

But what if the user looks at the output and says "make the hero bigger and change the CTA"? That feedback should flow back into the pipeline, not trigger a full regeneration. The agents should understand what changed and why, and make surgical updates.

This is where multi-agent systems start feeling less like a pipeline and more like a team. Each agent maintains context about previous decisions, understands the user's evolving intent, and adapts.

We're not there yet. Nobody is. But the foundation — specialized agents with distinct roles that can disagree and iterate — is solid, and the results are already dramatically better than single-shot generation.

## If you're experimenting with this

A few things I wish someone had told me earlier:
1. Start with two agents, not four. Generator + Critic is enough to see massive quality gains. Add the planner and refiner once you've nailed the feedback loop between those two.
2. The critic prompt matters more than the generator prompt. Spend 80% of your prompt engineering time on teaching the critic what "good" looks like. A mediocre generator with a great critic will outperform a great generator with no critic.
3. Log everything between agents. You'll need to debug why the pipeline produced a weird output, and without inter-agent communication logs, you're flying blind.
4. Don't use the same model for every agent. Your planner can be a cheaper, faster model. Your critic might need a more capable one. Mix and match based on the cognitive load of each role.
Happy to go deeper on any of these if there's interest. I've been living in this problem space for a while now and it's one of those areas where the practical knowledge is way ahead of the published research.

I'm building (https://valmera.io) — a web tool builder that uses multi-agent pipelines under the hood. Still early, but the research is turning into real product. If this kind of architecture interests you, I'd genuinely love to hear what you're working on too.
valmera

·
4 days ago
·