AI Context in the Real-World

Context in Context
It’s time to make context concrete. I recently wrote about AI context management techniques including optimisation, compaction, trimming and rolling. Today I wanted to show what these techniques look like in action.

I’ve been working on some large enterprise AI agents recently and I’m going to share exactly how we use our in-house orchestration to build long-running agents that never exhaust their context.

A Real-World Example
We’ve recently added fully automated AI test agents to Brunelly. You give them a URL, a brief set of instructions and off they go to explore, generate test cases and use the app as a real user.

The first time I set an agent running an exploratory test was great. Screenshots, bug reports, proposed new test cases. Then a failure once the context window was exhausted. It took less than ten minutes for the agent to fail.

To enable a test agent we provide tools that translate webpage DOMs into a much shorter JSON schema that is friendly to AI models. But there’s still a lot of data there – maybe 5KB for a complex page load. Several pages in and the context was full of past test steps.

Our in-house AI orchestrator, Maitento, provides built in context compaction and trimming functionality. Each AI interaction can set a couple of lines of configuration that define whether to enable trimming, how to apply it to tool calls, whether to enable rolling context, which messages to protect and if we want to enable compaction.

Our test runner is configured to auto-trim tool calls once they are 5-cycles deep and simply replace the response JSON with ‘Tool call result was trimmed’ so that the model is aware it made a call and that its result has now been removed. We also enable rolling context but keep the system messages protected so that it is aware of the overall task. We have rolling context setup to be bursty to get some benefit of caching so that once we hit 90% we roll back to 75% and then grow to 90% again. We do not enable compaction at all.

In this scenario we balance the cost of the interaction (trimming invalidates the cache quite frequently) with a smaller context size and never losing the overall strategy that’s being followed. This is very important for an autonomous agent that needs to know exactly what it’s exploring and where it’s up to.

Our orchestration engine is aware of the context window of each model and so presents different context to each agent based on the configuration. We could have several agents working together all seeing a different representation of the full context. We keep the full transcript in storage – trimming only affects what the model sees and the ground truth is never lost.

The test agent gets given access to an internal API within Brunelly to perform its tests. Firstly, we craft each endpoint to be tailored to how a model will work best. We combine actions to make sense to an AI using composite functionality rather than object-level views of things. This could fill up the rest of this context pretty quickly.

That’s where the last hidden gem of Maitento comes in. The orchestrator itself provides a pre-populate and post-transform phase in that can be defined in every tool call. The ID of the test that’s being run, specific credentials, tenant details, etc. are all completely removed from the model’s view of the world. Our runtime takes the OpenAPI or MCP schema, removes pre-bound elements and then presents a much smaller version to the model for requests and it does the same for responses just return relevant paths that the model needs. This means that even with an already optimised API we reduce the context bloat of tool calls by around ~30% on average in our workloads.

Your orchestrator matters.

Don’t Be Afraid to Combine Techniques
Each agent is different. This setup increases token costs as our tool trimming invalidates cache quickly – but balances it against intent-drift and longevity.

We combine rolling and trimming with window ranges to create our bursty rolling window with this agent. Others don’t need it.

Our tool calls are optimised to reduce the number of calls an agent needs to make to one per cycle and pipeline transforms sort the rest out.

We have other agents that just take in a huge amount of data, call dozens of APIs and then translate it into a large JSON blob. They don’t need any of this as they’re lifecycle and tool content is too small to matter. Optimise a lot where it matters and less so elsewhere.

Ultimately… it’s Just Short-Term Memory
In some ways AI models aren’t that different to humans. We have short-term and long-term memory. Context is the short-term memory.

If you’re writing code how often can you remember exactly what each line of code said in the file you were in 30 seconds ago? Would it even help if you did remember?

We generally have extremely vivid short-term memory in the realm of seconds that gradually tails off to less detail over minutes, hours and lifetimes. I’m sure I don’t need to remember in excruciating detail what some nginx log told me 15 years ago.

The same is true for models. Give them what they need to work their best – a context tailored to the task at hand that is present in a way your model prefers.

There is no one size fits all approach to context management, but if you aren’t actively architecting the design of your context there’s no way you’re ever going to create any long-running agentic systems that provide real-world use.

Say something nice to guy_powell…

1

this is a great write-up. we’ve run into similar issues with long-running AI workflows. the short-term context filling up fast and retries causing duplicated side effects were big headaches for us. we ended up breaking workflows into small steps, persisting state between them, and adding safe pause/resume points for human approvals. curious how you decide which messages to keep protected vs trim in your setup?

Interesting_Ride2443

·
4 months ago
·
1. 1
  That duplication pain usually shows up the moment you mix long-running workflows with tool side-effects. The model forgets what it already did, retries, and suddenly you’ve sent two emails or created two tickets. Breaking workflows into discrete steps with persisted state is exactly the right instinct, you’re externalising memory instead of trusting context alone.
  
  On protection vs trimming, we treat it as a question of intent vs history.
  
  Protected messages are anything that defines the invariant of the task:
  -The system prompt and role definition
  
  The high-level objective
  
  Guardrails and constraints
  
  Current step or phase in the workflow
  
  Any irreversible decisions already made
  
  Those form the “working contract” of the agent. If that gets trimmed, drift begins.
  
  Trim candidates are usually:
  
  Large tool responses once consumed
  
  Verbose logs
  
  Intermediate reasoning that no longer affects forward motion
  
  Redundant confirmations
  
  If a message doesn’t change the strategy or future branching decisions, it’s a trimming candidate.
  
  The other critical rule is: never let context be the source of truth for state. Context is short-term memory. Durable state, side-effects, approvals, checkpoints, should live outside it. When we resume, we reconstruct only the minimal state the model needs to reason correctly, not the entire historical transcript.
  
  In short: protect intent, externalise state, trim noise. That’s what keeps long-running agents coherent without bloating their short-term memory.
  guy_powell
  
  ·
  4 months ago
  ·