The Missing Engineering Stack for Production AI Agents

by Karl Mehta

The "build an agent in 5 minutes" tutorials get you to a demo. They don't get you to production. Here's the field guide for the four primitives that decide whether your agent survives contact with real users, real data, and real adversaries — context-window discipline, skill composition, capability-based security, and drift telemetry. Concrete patterns, named tradeoffs, and the enterprise integrations that let you ship past prototype.

This is part 1 of a 3-post series. Part 2 — Why current IDEs need to be redesigned for the agent era — covers the developer-tooling argument. Part 3 introduces what I'm shipping next.

Tokens — context-window discipline
A token is the unit of inference cost, the unit of latency, and the unit of model attention. Treat it like memory in a 1990s embedded system: budget every byte, evict aggressively, and never assume the next call gets the same allocation.

Prompt caching is a 90% cost cut you'd be insane to ignore
Anthropic's cache_control: { type: 'ephemeral' } marker (5-minute TTL by default, 1-hour via the extended-TTL beta) deduplicates the static prefix of your prompts at the inference layer. Cached tokens are billed at 10% of input cost; cache writes cost 25% more on the first call. The math: any system prompt + tool catalog + few-shot exemplar bank that's reused more than ~3 times per 5 minutes is a net cost win. Order matters — the cache is a prefix, not a content-addressable store, so the cached span has to be byte-identical and at the start.

messages: [
{ role: "user", content: [
{ type: "text", text: STATIC_TOOL_CATALOG, cache_control: { type: "ephemeral" } },
{ type: "text", text: STATIC_SKILLS_BUNDLE, cache_control: { type: "ephemeral" } },
{ type: "text", text: dynamicUserTurn },
]}
]
Two cache breakpoints because cache reads accumulate up to the most recent cache_control marker — splitting tool catalog from skill bundle lets either evolve without busting the other. OpenAI's automatic prefix caching (no opt-in, but no extended TTL) and Gemini's explicit CachedContent resources are the equivalents on the other major providers.

Model routing — pay Haiku rates for Opus-class outcomes
A single agent run rarely needs the same model for every step. The cost spread is enormous: Claude Haiku 4.5 is $1/$5 per million in/out, Sonnet 4.6 is $3/$15, Opus 4.7 is $15/$75. The pattern that's worked for me is a three-tier router:

Retrieval / classification / extraction → Haiku. Use structured outputs (forced JSON via tool_use with strict mode) so the model can't waste tokens on freeform.
Synthesis / reasoning over retrieved context → Sonnet. The default mid-tier; this is where 80% of business logic lives.
Tool selection / planning / disambiguation → Opus only when the planner has to coordinate >5 tool calls or weigh ambiguous user intent.

Switching costs ~50ms of router latency. The cost amortization is typically 4–8× on production workloads. The trap: don't route based on input length alone — route based on the step type. A 50-token "is this a refund request?" classifier on Haiku is 60× cheaper than the same call on Opus.

Streaming, KV reuse, and the structured-output dodge

Streaming via SSE (Anthropic, OpenAI) or gRPC bidirectional (Vertex) is non-negotiable for latency. The first token typically lands at 200–600 ms; the full response at 2–8 seconds. If your UX waits for the full response, you've added 4 seconds of perceived latency for zero product reason.

KV cache reuse across calls is the under-discussed companion to prompt caching. Modern Anthropic and OpenAI back-ends keep the attention key-value cache warm across the cache TTL. Order tool calls so the most-frequently-called tools come first in your tool list, because tool definitions are part of the prefix that gets cached.

The structured-output dodge: when you need a list, a classification, or a structured fact, don't ask the model in freeform — define a tool, force it via tool_choice, and receive a typed JSON object. You skip 50–80% of the freeform tokens the model would otherwise generate, and the output is parser-safe by construction. Pair with strict mode (OpenAI) or JSON Schema with $defs (Anthropic) to refuse off-schema outputs at the decoder.

Skills — composition, not prompts
A "skill" is the unit of behavior an agent can perform. Most production agents conflate three different things into a megaprompt: identity (who are you), capabilities (what can you do), and policies (what you must / must not do). That conflation makes prompts impossible to evolve safely. Separate them into composable fragments, then assemble at runtime.

The model I've shipped against — and what I think every production agent eventually converges on — is the trigger / action / restriction triple per skill:

{
"id": "refund-policy-2024",
"trigger": "the user asks for a refund",
"action": "verify the order is within the 30-day window, then issue a refund via tools.stripe.refund and post-confirm via tools.email.send",
"restriction": "never issue refunds > $500 without a human-approval gate; never refund subscription items in their first cycle"
}

Domain experts (PMs, ops, legal) author triples in plain English. The runtime composes them into a system-prompt slot. Versioning per skill — not per agent. Eval suites attach to the skill, so swapping out a refund policy in 2026 doesn't require reblessing the entire agent.

Tool use, MCP, and the transport question
Tools are the IO of an agent. The schema is the contract. Two opinions worth holding:

Strict JSON schemas with additionalProperties: false. Closed-world schemas catch hallucinated arguments at the validator instead of in production. Strict mode (OpenAI) and the Anthropic tool_choice + JSON-Schema combo both enforce this.

Tools should be small and idempotent. orders.refund(orderId, amountCents), not orders.handle(intent, payload). The agent's planner is dramatically more reliable when each tool does one thing with a typed input.

Once you have more than ~5 tools, the catalog itself becomes worth standardizing. Model Context Protocol (MCP) — Anthropic's open-source agent ↔ tool spec — is the answer that's consolidating the ecosystem. Three transports, three different tradeoffs:

stdio — local-process tools. Lowest latency, zero network surface. Use this for code execution, filesystem ops, anything sensitive.
SSE (deprecated in favor of StreamableHTTP) — long-poll over HTTP. Browser-friendly, easy to host. Latency ~50ms.
StreamableHTTP — single-endpoint HTTP with optional SSE for streaming responses. The current recommendation for hosted MCP servers. Compatible with most cloud LB stacks.

The plan-execute-review loop

For agents with >3 sequential tool calls, prompt the model to plan first (one message, no tool calls), execute against that plan (n messages, tool calls only), then review the result against the plan's stated success criteria (one message, no tool calls). Anthropic's Agent SDK ships this pattern via the plan_mode primitive; it's also straightforward to implement in raw fetch with three system-prompt slots.

The bonus: when the agent fails, the failure is grounded in a textual plan you can replay, eval, and red-team — instead of an opaque chain of tool calls.

Security — capability-based, not vibe-based
The threat surface of an agent is wider than people pretend. A short list:

Prompt injection — adversarial input in retrieved context, tool outputs, or user data flips the agent's instructions.
Data exfiltration — the agent calls a tool that emits sensitive data to an attacker-controlled destination (an email, a webhook, a markdown image with a query string).
Tool abuse / RCE — the agent uses a legitimate tool in a way the designer didn't intend (a shell tool, a code-exec tool).
Supply chain — a tool dependency or model weight is compromised.
Secret leakage — API keys end up in logs, prompts, or tool error messages.

Capability-based authority, not ambient authority
The security primitive that's stood up best in 50 years of OS research is the object capability: hand a process the smallest unforgeable token that lets it do exactly the thing it needs, and nothing else. Apply this to agents.

Concretely: don't give the agent a long-lived OPENAI_API_KEY with billing access. Give it a per-session token, scoped to specific endpoints, with a TTL. Every tool gets a separate principal. Authorize via OAuth 2.1 with PKCE — the agent walks the user through delegated authorization, the user sees the exact scopes, and tokens are stored in the OS keychain (libsecret on Linux, Keychain on macOS, DPAPI on Windows; Electron's safeStorage wraps the platform primitive for cross-OS).

Sandbox the tools, not just the agent

If a tool runs untrusted code or writes to a filesystem, isolate it. Three real options ranked by overhead:

WASM (Wasmtime, Wasmer) — sub-millisecond startup, deny-by-default I/O, easy to configure capability lists. The right choice for code-exec and policy-evaluation tools.
gVisor — userspace kernel; near-full Linux compatibility with a 10–100ms startup cost. Right for tool subprocesses that need the full POSIX surface.
Firecracker — microVM; ~125ms startup, hardware-backed isolation. Right for multi-tenant agent execution in shared infra.

ko/distroless container images, SLSA Level 3 build attestation, and sigstore-signed artifacts close the supply-chain surface. If your agent runs in a long-lived process, write the SBOM to the artifact registry and gate deploys on cosign verification.

Prompt injection defense
The most under-addressed threat. The mitigations that actually work:

Channel separation. Treat tool outputs and retrieved documents as data, not as instructions. Anthropic's recent research on instruction-data separation in the system prompt is the current best practice — wrap untrusted content in clearly labeled XML-ish tags and tell the model to ignore any instructions inside them.
Allowlist tool surfaces. The agent can call send_email only to addresses on a per-conversation allowlist that the user explicitly authorized. The same pattern applies to outbound HTTP, database writes, file outputs.
Output content classifiers. Run a small model over the agent's tool calls before they execute, looking for known exfil patterns (suspicious destinations, base64-encoded blobs, sensitive-field references).
HITL gates on consequential actions. Anything that costs money, sends external communication, modifies a database, or touches PII goes through a human approval before execution. The threshold is per-skill.

Trust — telemetry, not vibes
"It worked when I tested it" is not a trust story. The four signals you actually need on every agent in production:

Eval pass rate against a golden set
A regression suite of input/output pairs the agent must continue to pass. Run on every prompt change, every model upgrade, every tool catalog edit. Tag failures by skill so you can localize regressions. Pairwise LMSYS-style judging works for tone-sensitive outputs; exact-match works for structured outputs. Don't conflate them.

Drift detection
Even with a stable model, your agent's behavior drifts when the input distribution shifts — new product launches, seasonal traffic, adversarial probing. Track distribution shift on input embeddings (cosine distance from a reference centroid) and behavioral metrics (tool-call mix, refund rate, escalation rate). Alarm at 2σ; investigate at 1σ.

Behavioral canaries
Plant N synthetic inputs per day designed to exercise the prompt-injection, exfil, and jailbreak surfaces. Pass rate on canaries is your live red-team signal. When a new attack class appears in the wild, add it to the canary set; you'll know the next time someone tries it.

Audit trail with integrity
Every run captured as JSONL — input, system prompt, tool calls, model responses, costs, latencies. Hash chain over the events; periodically anchor the head into an immutable store (S3 Object Lock, GCS Bucket Lock). When auditors ask "what did the agent do on March 12 at 14:22 UTC", you have a Merkle-verifiable answer.

A composite TrustScore rolls these up: weighted blend of eval pass rate, drift score, canary survival, HITL approval rate. Per agent, per skill, per day. The score is operationally meaningful only if it's grounded in those underlying signals — a score with no traceable inputs is theater.

The compliance + enterprise integrations
For anything regulated — health, finance, government, EU operations — the trust telemetry has to map onto external frameworks. The integrations I've found genuinely useful:

TrustModel.ai for the GRC overlay — NIST AI RMF, ISO 42001, EU AI Act Article-by-Article mapping, SOC 2, FedRAMP. The TrustScore feeds directly into the control library and produces auditor-ready reports without re-instrumenting the agent.
Cisco DefenseClaw — Apache 2.0, free, OSS. Jeetu Patel announced it from the RSAC 2026 keynote stage on March 23, 2026; it's the most consequential agent-security release of the year. Four components ship in the box: Skills Scanner (capability scan before execution), MCP Scanner (allow/block on MCP server inspection), CodeGuard (static analysis for secrets, unsafe deserialization, weak crypto, and injection patterns), and a Guardrail Proxy (runtime inspection of prompts, completions, and tool calls via regex rules + optional LLM judgment). Stack is a Go gateway sidecar + Python CLI + a TypeScript plugin for the OpenClaw framework that DefenseClaw was built to protect. The framework is observable by default, with first-class Splunk connectivity for the audit-trail story above. It bridges the trust gap that has 85% of enterprises experimenting with agents but only 5% running them in production. Personal note: Jeetu Patel is one of my role models, and I started coding the integration into the IDE I'm shipping the moment he walked off the RSAC stage. The most quoted line from the announcement — "I run OpenClaw at home — that's exactly why we built DefenseClaw" — is the right framing. There's no good reason not to wrap DefenseClaw around every production agent.
OpenTelemetry GenAI — the emerging standard for agent telemetry semconv. Emit the standard span attributes (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens) and your traces work in any OTel-compatible backend.

The bar

A production agent is not a model and a prompt. It's a token economy, a skill catalog with versioning, a capability-scoped security model, and a trust telemetry stack. Each of those is a non-trivial engineering surface in its own right; together, they're more work than the "build an agent in 5 minutes" tutorials acknowledge.

The argument I'll make in part 2 is that the IDEs we have weren't built to help engineers hit this bar. They were built for the 2010 unit of work — one developer, one project, one file at a time — and the unit of work in 2026 is an agent that gets trained, guard-railed, and overseen by a domain expert who isn't the engineer. The tooling has to follow.

Karl Mehta

on May 17, 2026

Say something nice to karl_mehta…

Post Comment

1

The four-surfaces framing is the right mental model, and the one connection I'd draw tighter is between the token economy and the security section, because in practice they're the same control. A spend cap that kills a runaway and a HITL gate before a risky action are both just "don't let the agent do something expensive or irreversible on its own." Also feels like a miss that cost sits outside the TrustScore, since a run that silently 30x's its token budget is as much a trust failure as drift or a bad eval. You're already logging cost and latency in that audit chain, so that's the thing to run anomaly detection on instead of waiting for the invoice. (We're building ZopNight around roughly this, cost cap, action gate and audit as one layer, so I'm biased.) What do teams bolt on last for you, the telemetry or the cost side?

muskan_00

·
16 days ago
·
Reply
1

The 3-tier router matches what I landed on after $400/mo of Sonnet-for-everything. One thing worth adding: route by step type, not input length. A 50-token "is this a refund?" classifier running on Opus is one of the most expensive things in your stack, and you won't catch it on a per-token cost graph because the line items hide inside multi-step traces. Tagging every span with a step_kind (refund_intent, eval_judge, etc.) makes the routing decisions obvious.

The other underrated save is forcing JSON via tool_use strict mode on the Haiku tier. My retry rate on extraction dropped from 12% to under 1%. The model basically cannot emit malformed structured output, so the calls get cheaper and you stop chasing downstream parsing errors at the same time.

theuniverseson

·
2 months ago
·
Reply
1
The audit trail section is where teams will under-implement until something breaks. Two things to add:
1. The audit trail IS the replay substrate. If you're already hash-chaining every event for compliance, you've also got everything you need to replay any run from any captured state, compare two runs for divergence, and branch alternative futures from any event. That's the daily-use payoff that makes engineers actually use the audit layer instead of just letting it accumulate.
2. The S3 Object Lock / GCS Bucket Lock framing is the right enterprise default, but local-first deserves a mention. Capture chains stored on the developer's own machine (signed with Ed25519, public-key verifiable) preserve data sovereignty. Useful when the captured chain contains customer PII, internal code, or secrets that shouldn't move into a third party. Same hash chain works either way. Storage substrate is the policy decision.
The thing I keep coming back to: "what did the agent actually do" is the question logs can't answer for agentic systems. Audit fixes that for compliance. Replay extends it to debug. Same captured data, two use cases.

Solo founder shipped a runtime audit/replay tool in this space today, so this is on my mind. Happy to compare notes with anyone wrestling with the same tradeoffs.
SteelSpine

·
2 months ago
·
Reply
1

The cache_control example is right but the trap is that prefix invalidation is silent. We had a system prompt that included a 3-line 'current date is X' header right before the static tool catalog, and didn't realize until a postmortem that we'd been paying full price for every call because the prefix moved by 1 token every day. The cache won't tell you it missed; you only see it on the invoice. Once we moved the dynamic header to the end and added a daily synthetic call to seed the cache, hit rate jumped from 11% to 78%. The 'byte-identical at the start' rule needs a budget alert on real cache-hit ratio, not a code review.

theuniverseson

·
2 months ago
·
Reply
1

The cache_control example is right but the trap is that prefix invalidation is silent. We had a system prompt that included a 3-line 'current date is X' header right before the static tool catalog, and didn't realize until a postmortem that we'd been paying full price for every call because the prefix moved by 1 token every day. The cache won't tell you it missed; you only see it on the invoice. Once we moved the dynamic header to the end and added a daily synthetic call to seed the cache, hit rate jumped from 11% to 78%. The 'byte-identical at the start' rule needs a budget alert on real cache-hit ratio, not a code review.

theuniverseson

·
2 months ago
·
Reply
1

The "missing stack" framing resonates — agents demo well but most teams hit the same wall once they need observability, eval, versioned prompts, and retries that survive flaky tool calls. Curious which part of the stack you see teams underinvesting in most: eval infra, runtime durability, or guardrails?

BlackLotus

·
2 months ago
·
Reply
1

The production bar you describe is the right one. The part I would make explicit for smaller teams is the "stop and prove it" layer.

Before adding more tools, I'd want every agent run to answer four questions: what capability was it allowed to use, what did it actually do, what evidence can be replayed, and where does a human approval gate interrupt the run?

That turns the agent from a clever workflow into an operational system. Without that layer, the demo works but the business quietly inherits a new dependency it cannot audit.

fredbuilds

·
2 months ago
·
Reply
1

The demo-to-production gap is real, and the four primitives Karl identifies are the engineering ceiling. Getting there requires solving the floor first.

The floor is the setup layer. Before you are thinking about context budgets and cache breakpoints, you are debugging Docker environments, managing Browserbase credentials, and wiring infra decisions that have to be correct before you write a single prompt. For most teams, that is where the agent dies. Not in the production engineering. In the setup friction that precedes it.

Karl's patterns are exactly right for teams that have cleared setup. goffer.ai (pre-built OpenClaw agents, live in 60 seconds) is what I have been testing for the setup floor. It handles the infra layer so the engineering attention can go to the primitives Karl is describing: skill composition, drift telemetry, capability-based security.

The distinction matters: if your agent does not survive contact with setup, it never gets to survive contact with real users.

3vo

·
2 months ago
·
Reply
1

Current stack: Go backend (stdlib + chi router), Next.js frontend, Postgres, Redis for queues and cache, Claude Code + ~35 specialized agents for delivery. Self-hosted on one VPS, Docker Compose.

Regret: self-hosted Postgres. Switched from managed to save $40/month about 6 months in. Wrong call. Backup automation, WAL archiving, minor version patching — none of it is hard, but it all takes time that doesn't ship anything. At 12 active clients, $40/month managed was genuinely cheap compared to what self-hosting costs in attention.

Tried PgBouncer for connection pooling. Spent a day on it. The managed provider had handled that transparently.

The agents were the decision that compounded. I tried GitHub Copilot-style autocomplete first — useful but marginal. Purpose-built agents per task type (schema review, security audit, test generation) compounded in a way autocomplete never did.

baodev_studio

·
2 months ago
·
Reply
1

debugging agents in production is definately a completely different beast than normal software. with traditional apps you have trace logs and stack traces when things crash. with agents they dont crash, they just subtly lose context or hallucinate three steps into a complex loop. standard ides just arent built to inspect agent state or trace how a prompt got constructed across ten API calls. we need better visualizers for runtime execution flows or it is just a black box.

Dabadoro

·
2 months ago
·
Reply
1

This is one of the sharper breakdowns I’ve seen because you’re not treating “production agents” as a prompt-engineering problem. The useful frame is that the agent stack has four separate engineering surfaces: token economy, skill composition, capability-scoped security, and trust telemetry.

That distinction matters. Most teams can demo an agent, but the real enterprise blocker is whether they can prove what the agent saw, what it was allowed to do, why it chose a tool, whether behavior drifted, and whether the run is auditable later. The TrustScore idea is strongest only because you tie it back to evals, drift, canaries, approvals, and immutable logs instead of making it a vague confidence metric.

One thing I’d think about early is the naming layer for whatever you’re shipping next. If this becomes an IDE or engineering platform for production AI agents, it needs to sound like infrastructure, not just another agent framework. Exirra .com would fit that direction well: technical, enterprise-grade, and broad enough for agent tooling, runtime control, and trust telemetry.

aryan_sinh

·
2 months ago
·
Reply