The "build an agent in 5 minutes" tutorials get you to a demo. They don't get you to production. Here's the field guide for the four primitives that decide whether your agent survives contact with real users, real data, and real adversaries — context-window discipline, skill composition, capability-based security, and drift telemetry. Concrete patterns, named tradeoffs, and the enterprise integrations that let you ship past prototype.
This is part 1 of a 3-post series. Part 2 — Why current IDEs need to be redesigned for the agent era — covers the developer-tooling argument. Part 3 introduces what I'm shipping next.
Prompt caching is a 90% cost cut you'd be insane to ignore
Anthropic's cache_control: { type: 'ephemeral' } marker (5-minute TTL by default, 1-hour via the extended-TTL beta) deduplicates the static prefix of your prompts at the inference layer. Cached tokens are billed at 10% of input cost; cache writes cost 25% more on the first call. The math: any system prompt + tool catalog + few-shot exemplar bank that's reused more than ~3 times per 5 minutes is a net cost win. Order matters — the cache is a prefix, not a content-addressable store, so the cached span has to be byte-identical and at the start.
messages: [
{ role: "user", content: [
{ type: "text", text: STATIC_TOOL_CATALOG, cache_control: { type: "ephemeral" } },
{ type: "text", text: STATIC_SKILLS_BUNDLE, cache_control: { type: "ephemeral" } },
{ type: "text", text: dynamicUserTurn },
]}
]
Two cache breakpoints because cache reads accumulate up to the most recent cache_control marker — splitting tool catalog from skill bundle lets either evolve without busting the other. OpenAI's automatic prefix caching (no opt-in, but no extended TTL) and Gemini's explicit CachedContent resources are the equivalents on the other major providers.
Model routing — pay Haiku rates for Opus-class outcomes
A single agent run rarely needs the same model for every step. The cost spread is enormous: Claude Haiku 4.5 is $1/$5 per million in/out, Sonnet 4.6 is $3/$15, Opus 4.7 is $15/$75. The pattern that's worked for me is a three-tier router:
Switching costs ~50ms of router latency. The cost amortization is typically 4–8× on production workloads. The trap: don't route based on input length alone — route based on the step type. A 50-token "is this a refund request?" classifier on Haiku is 60× cheaper than the same call on Opus.
Streaming, KV reuse, and the structured-output dodge
Streaming via SSE (Anthropic, OpenAI) or gRPC bidirectional (Vertex) is non-negotiable for latency. The first token typically lands at 200–600 ms; the full response at 2–8 seconds. If your UX waits for the full response, you've added 4 seconds of perceived latency for zero product reason.
KV cache reuse across calls is the under-discussed companion to prompt caching. Modern Anthropic and OpenAI back-ends keep the attention key-value cache warm across the cache TTL. Order tool calls so the most-frequently-called tools come first in your tool list, because tool definitions are part of the prefix that gets cached.
The structured-output dodge: when you need a list, a classification, or a structured fact, don't ask the model in freeform — define a tool, force it via tool_choice, and receive a typed JSON object. You skip 50–80% of the freeform tokens the model would otherwise generate, and the output is parser-safe by construction. Pair with strict mode (OpenAI) or JSON Schema with $defs (Anthropic) to refuse off-schema outputs at the decoder.
The model I've shipped against — and what I think every production agent eventually converges on — is the trigger / action / restriction triple per skill:
{
"id": "refund-policy-2024",
"trigger": "the user asks for a refund",
"action": "verify the order is within the 30-day window, then issue a refund via tools.stripe.refund and post-confirm via tools.email.send",
"restriction": "never issue refunds > $500 without a human-approval gate; never refund subscription items in their first cycle"
}
Domain experts (PMs, ops, legal) author triples in plain English. The runtime composes them into a system-prompt slot. Versioning per skill — not per agent. Eval suites attach to the skill, so swapping out a refund policy in 2026 doesn't require reblessing the entire agent.
Tool use, MCP, and the transport question
Tools are the IO of an agent. The schema is the contract. Two opinions worth holding:
Strict JSON schemas with additionalProperties: false. Closed-world schemas catch hallucinated arguments at the validator instead of in production. Strict mode (OpenAI) and the Anthropic tool_choice + JSON-Schema combo both enforce this.
Tools should be small and idempotent. orders.refund(orderId, amountCents), not orders.handle(intent, payload). The agent's planner is dramatically more reliable when each tool does one thing with a typed input.
Once you have more than ~5 tools, the catalog itself becomes worth standardizing. Model Context Protocol (MCP) — Anthropic's open-source agent ↔ tool spec — is the answer that's consolidating the ecosystem. Three transports, three different tradeoffs:
The plan-execute-review loop
For agents with >3 sequential tool calls, prompt the model to plan first (one message, no tool calls), execute against that plan (n messages, tool calls only), then review the result against the plan's stated success criteria (one message, no tool calls). Anthropic's Agent SDK ships this pattern via the plan_mode primitive; it's also straightforward to implement in raw fetch with three system-prompt slots.
The bonus: when the agent fails, the failure is grounded in a textual plan you can replay, eval, and red-team — instead of an opaque chain of tool calls.
Capability-based authority, not ambient authority
The security primitive that's stood up best in 50 years of OS research is the object capability: hand a process the smallest unforgeable token that lets it do exactly the thing it needs, and nothing else. Apply this to agents.
Concretely: don't give the agent a long-lived OPENAI_API_KEY with billing access. Give it a per-session token, scoped to specific endpoints, with a TTL. Every tool gets a separate principal. Authorize via OAuth 2.1 with PKCE — the agent walks the user through delegated authorization, the user sees the exact scopes, and tokens are stored in the OS keychain (libsecret on Linux, Keychain on macOS, DPAPI on Windows; Electron's safeStorage wraps the platform primitive for cross-OS).
Sandbox the tools, not just the agent
If a tool runs untrusted code or writes to a filesystem, isolate it. Three real options ranked by overhead:
WASM (Wasmtime, Wasmer) — sub-millisecond startup, deny-by-default I/O, easy to configure capability lists. The right choice for code-exec and policy-evaluation tools.
gVisor — userspace kernel; near-full Linux compatibility with a 10–100ms startup cost. Right for tool subprocesses that need the full POSIX surface.
Firecracker — microVM; ~125ms startup, hardware-backed isolation. Right for multi-tenant agent execution in shared infra.
ko/distroless container images, SLSA Level 3 build attestation, and sigstore-signed artifacts close the supply-chain surface. If your agent runs in a long-lived process, write the SBOM to the artifact registry and gate deploys on cosign verification.
Prompt injection defense
The most under-addressed threat. The mitigations that actually work:
Channel separation. Treat tool outputs and retrieved documents as data, not as instructions. Anthropic's recent research on instruction-data separation in the system prompt is the current best practice — wrap untrusted content in clearly labeled XML-ish tags and tell the model to ignore any instructions inside them.
Allowlist tool surfaces. The agent can call send_email only to addresses on a per-conversation allowlist that the user explicitly authorized. The same pattern applies to outbound HTTP, database writes, file outputs.
Output content classifiers. Run a small model over the agent's tool calls before they execute, looking for known exfil patterns (suspicious destinations, base64-encoded blobs, sensitive-field references).
HITL gates on consequential actions. Anything that costs money, sends external communication, modifies a database, or touches PII goes through a human approval before execution. The threshold is per-skill.
Eval pass rate against a golden set
A regression suite of input/output pairs the agent must continue to pass. Run on every prompt change, every model upgrade, every tool catalog edit. Tag failures by skill so you can localize regressions. Pairwise LMSYS-style judging works for tone-sensitive outputs; exact-match works for structured outputs. Don't conflate them.
Drift detection
Even with a stable model, your agent's behavior drifts when the input distribution shifts — new product launches, seasonal traffic, adversarial probing. Track distribution shift on input embeddings (cosine distance from a reference centroid) and behavioral metrics (tool-call mix, refund rate, escalation rate). Alarm at 2σ; investigate at 1σ.
Behavioral canaries
Plant N synthetic inputs per day designed to exercise the prompt-injection, exfil, and jailbreak surfaces. Pass rate on canaries is your live red-team signal. When a new attack class appears in the wild, add it to the canary set; you'll know the next time someone tries it.
Audit trail with integrity
Every run captured as JSONL — input, system prompt, tool calls, model responses, costs, latencies. Hash chain over the events; periodically anchor the head into an immutable store (S3 Object Lock, GCS Bucket Lock). When auditors ask "what did the agent do on March 12 at 14:22 UTC", you have a Merkle-verifiable answer.
A composite TrustScore rolls these up: weighted blend of eval pass rate, drift score, canary survival, HITL approval rate. Per agent, per skill, per day. The score is operationally meaningful only if it's grounded in those underlying signals — a score with no traceable inputs is theater.
The compliance + enterprise integrations
For anything regulated — health, finance, government, EU operations — the trust telemetry has to map onto external frameworks. The integrations I've found genuinely useful:
The bar
A production agent is not a model and a prompt. It's a token economy, a skill catalog with versioning, a capability-scoped security model, and a trust telemetry stack. Each of those is a non-trivial engineering surface in its own right; together, they're more work than the "build an agent in 5 minutes" tutorials acknowledge.
The argument I'll make in part 2 is that the IDEs we have weren't built to help engineers hit this bar. They were built for the 2010 unit of work — one developer, one project, one file at a time — and the unit of work in 2026 is an agent that gets trained, guard-railed, and overseen by a domain expert who isn't the engineer. The tooling has to follow.
debugging agents in production is definately a completely different beast than normal software. with traditional apps you have trace logs and stack traces when things crash. with agents they dont crash, they just subtly lose context or hallucinate three steps into a complex loop. standard ides just arent built to inspect agent state or trace how a prompt got constructed across ten API calls. we need better visualizers for runtime execution flows or it is just a black box.
This is one of the sharper breakdowns I’ve seen because you’re not treating “production agents” as a prompt-engineering problem. The useful frame is that the agent stack has four separate engineering surfaces: token economy, skill composition, capability-scoped security, and trust telemetry.
That distinction matters. Most teams can demo an agent, but the real enterprise blocker is whether they can prove what the agent saw, what it was allowed to do, why it chose a tool, whether behavior drifted, and whether the run is auditable later. The TrustScore idea is strongest only because you tie it back to evals, drift, canaries, approvals, and immutable logs instead of making it a vague confidence metric.
One thing I’d think about early is the naming layer for whatever you’re shipping next. If this becomes an IDE or engineering platform for production AI agents, it needs to sound like infrastructure, not just another agent framework. Exirra .com would fit that direction well: technical, enterprise-grade, and broad enough for agent tooling, runtime control, and trust telemetry.