Context Windows Are Not Memory: How AI Agents Actually Remember Things – An Interview with Sreekanth Ramakrishnan

The agentic AI market sits at $7.06 billion in 2025 and is on track to reach $93 billion by 2032, growing at 44.6% annually. The numbers look clean on paper. The production reality is messier. MIT's State of AI in Business 2025 found that while 60% of organizations evaluated agentic tools, only 20% reached pilot stage, and just 5% made it to production. The core barrier, according to the same research, is not infrastructure or budget. It is learning: most agent systems do not retain context, adapt to feedback, or improve over time. They forget. When an autonomous system designed to execute multi-step workflows starts forgetting what it already did, downstream failures compound quickly.

Sreekanth Ramakrishnan is a Senior Software Engineer with over a decade of experience building AI-driven personalization and real-time distributed systems at scale. Co-author of the IEEE-published paper "Towards Automatic Linkage of Knowledge Worker's Claims with Associated Evidence from Screenshots," he has spent his career designing systems that have to work reliably under high traffic, low-latency, and continuous model updates across millions of users. His recent writing on agent memory architecture has drawn significant attention from practitioners navigating the gap between agent demos and agent deployments. In his recent HackerNoon article, “Why Your AI Agent Keeps Forgetting (Even With 1M Tokens),” he explores this failure mode in depth, arguing that scaling context windows does not solve memory reliability without deliberate system design.

We sat down with Sreekanth to talk about what memory actually means for AI agents, why the context window is the wrong mental model, and what it takes to build memory systems that hold up in production.

Most engineers think about agent memory as a context window problem. Why is that framing wrong?
I spent weeks debugging an agent that kept forgetting what it had already done, and the frustrating part was that the context window was huge. There was more than enough capacity. That is when it became clear to me that context and memory are not the same thing, and confusing them is one of the most common design mistakes in agent systems.

The context window is a working area. It is finite, expensive, and performance sensitive. When an agent runs a long task, it is accumulating tool outputs, observations, and intermediate reasoning at every step. A single web page fetch can be enormous. A PDF tool response can blow through what you thought was a generous budget in one call. Some agent implementations report input-to-output token ratios approaching 100:1. That means you are paying for a massive input to get a small response, often just the next tool call. Context behaves like a CPU cache rather than a database.

The deeper issue is that treating context as memory leads to a specific failure mode: context rot. As the context fills up with tool traces and observations, the model starts ignoring earlier instructions. It prioritizes recent tokens, even when those tokens are less relevant. The agent looks fine in the logs, because the context is technically there. But the effective attention is degraded. You do not see this in a quick demo. You see it after the agent has been running for a while on a genuinely complex task.

Can you walk us through how you think about agent memory as a layered system?
Once I stopped treating memory as one thing, the architecture started to make more sense. I think of it as five distinct layers, and most failures happen at the boundaries between them rather than within any single layer.

The first two are working memory and session memory. Working memory is just the tokens the model sees at each step: instructions, recent turns, and anything injected into the current context. Session memory maintains continuity across turns within a single run, tracking conversation state and tool outputs so they do not need to be manually reassembled. Both can grow unbounded if you do not manage them, and that is where you start needing the third layer: condensed memory. When you hit a context limit, you need to compact. Condensed memory is a lossy representation of prior history, injected as a summary to keep the agent's working context manageable. The tricky part is that compaction is irreversible. Whatever you summarize away, you cannot get back. You are trading correctness for efficiency, and getting that tradeoff wrong is costly.

The last two layers are durable memory and retrieval memory. Durable memory is what persists across sessions: preferences, long-lived facts, user-specific instructions, decisions made in previous runs. Retrieval memory is the indexing and tooling layer that makes the durable store actually usable at the moment of need. Semantic search, keyword matching, metadata-guided navigation. A durable store without good retrieval is just a warehouse nobody can find anything in. The retrieval layer is where you go from "the information exists" to "the agent actually uses it.".

What does failure actually look like across those layers?
Working memory overload is the most common one. The model starts ignoring earlier instructions because the recent, high-volume tool outputs are drowning them out. It is not that the instructions are gone. They are just getting less attention relative to everything that came after them.

Session memory bloat and poisoning is closely related. Large tool outputs, especially from web browsing or file reads, dominate the context and push out the reasoning steps that were actually guiding the agent. The agent can start prioritizing what it recently observed over what it was originally asked to do. Then you get compaction distortion, where the summary produced during context compaction drops key constraints, and the agent makes decisions downstream based on an incomplete version of its own history. I have seen this cause agents to redo work they already completed, not because they are broken, but because the summary lost the fact that the work was done.

Durable memory drift is subtler. Older preferences override newer instructions when conflict resolution is not explicit. Or conflicting facts coexist in the store without any mechanism to resolve them. And retrieval failures are underappreciated: the correct memory exists, but the retrieval system does not surface it at the right time. You get stale recalls, irrelevant context injected into a sensitive decision, or the right information simply missed because the search did not catch the right phrase. These are not model problems. They are memory architecture problems.

How do you approach the compaction problem specifically? That feels like the highest-stakes decision in the whole system.
It is, and I do not think there is a clean answer. The bias has to be toward keeping evidence and decisions, not summarizing them away. A compaction event that loses a key constraint is worse than operating with a larger context window. The short-term efficiency gain is not worth the correctness loss.
The pattern I find most useful is what I think of as a pre-compaction flush: before context fills up and forces a summary, trigger a deliberate step where the agent writes durable state to disk. You can implement this as a silent turn with no visible output to the user, just a housekeeping step where the agent explicitly records what it knows, what it decided, and what remains unresolved. That way, the compaction event is summarizing context that has already been checkpointed, not context that contains the only copy of something important. OpenClaw implements something close to this, and it is one of the more inspectable reference implementations I have come across.

The other principle is that condensation events should be restorable where possible. The compacted summary goes into the working context. The full event log, if you keep it, remains available for inspection and replay. You separate "what the model attends to now" from "what the system retains." That distinction matters a lot for debugging. When an agent behaves oddly three hours into a long run, you want to be able to audit the memory, not just the model.

Most teams seem to treat good durable memory as an afterthought. What does it look like in practice?
It is almost always an afterthought, and that is exactly when it becomes a problem. Teams get the prototype working and then realize they need to figure out persistence after the fact, by which point the architecture makes it hard to add cleanly.

Good durable memory has a canonical, auditable source of truth. If you cannot look at the agent's memory and understand what it knows, debugging becomes guesswork. That does not necessarily mean plain text or markdown, but it has to be inspectable. There also needs to be a clear separation between what I call journal memory, the running log of what happened during a session, and policy memory, the stable preferences and instructions that should persist. Mixing those two things is a common mistake. You end up with agents that let session artifacts bleed into global preferences, or that treat old event logs as authoritative instructions.

Explicit lifecycle governance is the other piece. Memory goes through a distill–consolidate–inject cycle repeatedly, and the rules for that process need to be written down and enforced. Who wins when there is a conflict between a global preference and a session note? What is the precedence order? "Latest user message wins; session overrides global; recency resolves conflicts within a tier" is the kind of rule that sounds obvious but is almost never documented, which means the system behavior is undefined in edge cases. Undefined behavior in a memory system shows up as weird agent decisions that nobody can explain after the fact. As a program committee member and reviewer at the 45th IBIMA International Conference, I spend a lot of time evaluating research on AI systems design, and the papers that hold up are the ones where memory lifecycle governance is treated as a first-class concern, not a configuration detail.

You built large-scale ML systems at a major media platform and later at a global streaming platform. How did that experience shape how you think about agent memory?
It makes me impatient with systems that only work at demo scale. The failure modes at production volume are different from the failure modes in a prototype, and memory is where I see that gap most clearly in agent systems.
When I worked on reinforcement learning-based ranking at a major media platform, the system was serving around 12,000 queries per second with roughly 4 millisecond query latency and storing close to a billion records. The thing that kept that system honest was that you could not hide a design flaw. At that volume, anything you got wrong at the architecture level showed up fast and visibly. Agent memory systems have the same property: the problems that are invisible at low request volumes become obvious the moment you have real traffic, real task complexity, and real session lengths. The difference is that most teams building agents have not run anything at that scale yet, so they are discovering the failure modes by accident in production rather than by design.

What should engineering teams do first when they sit down to build memory for an agent system?
Map the five layers before you write any code. Working memory, session memory, condensed memory, durable memory, retrieval memory. For each one, answer two questions: what is the source of truth, and what is the eviction or expiration policy. If you cannot answer both for every layer, you are not ready to build yet.

The compaction strategy is the most urgent thing to define early, because retrofitting it is painful. You need to know, before the first context limit is hit, what gets summarized, what gets written to durable storage, and what gets discarded. That decision tree needs to be explicit, documented, and testable. Not in the sense of unit tests, although those are fine too, but in the sense that you can inspect the memory state at any point and explain why the agent knows what it knows. If you cannot explain it, you cannot debug it, and if you cannot debug it, you cannot trust it to run unsupervised on anything that matters.

The last thing I would say is to build in a human-readable audit layer from the start. Not for every deployment, but as a development tool. The systems I have seen work best treat memory files or logs as first-class artifacts that an engineer can open and read. When something goes wrong, and something will go wrong, the difference between a two-hour debug session and a two-day one is whether the memory state is interpretable. The goal is not a perfect recall. The goal is retrieving the right information at the right time, while safely discarding what is not needed. That is a systems design problem, and it rewards the same discipline as any other distributed systems problem at scale.