The Current State of Context Management for Agentic Systems
The context window is the only world your agent lives in.
It can’t look outside it. It can’t remember what happened two calls ago unless you explicitly put that information back in. It has no persistent self. It is, in the most literal sense, a stateless function: tokens in, tokens out.
This is the constraint that defines everything in agentic AI engineering. And in 2026, the practitioners who are building reliable production systems have stopped treating it as a nuisance and started treating it as the primary engineering surface.
They call it context engineering. The name is disputed - half of r/aiagents will tell you it’s a buzzword. They’re wrong and right at the same time.
Why prompt engineering was always a partial answer
When GPT-3 landed in 2020, the skill that emerged was prompt engineering: the art of crafting text inputs that coax better outputs from a model. Chain-of-thought. Role assignment. Few-shot examples. Constraint setting. These are real techniques. They still work.
But they were always working around a deeper problem.
Andrej Karpathy, who has been thinking about this longer than most, put it plainly in March 2026:
“It’s not really about ‘prompting’, it’s about context and spec engineering, and then all of the other harness tools.”
The distinction matters. Prompt engineering asks: what is the best way to phrase this instruction? Context engineering asks: what does this model need to know, in what form, assembled how, to have any chance of doing this task correctly?
The Paris hotel example from IBM Technology’s widely-watched explainer video is instructive. An agent booked a hotel in Paris, Kentucky instead of Paris, France. You could frame that as a prompting failure - the instruction wasn’t specific enough. But it’s more accurately a context failure: the agent didn’t have access to the user’s calendar, didn’t have a tool to verify the conference location, and didn’t have the travel policy that would have bounded its choices. Better wording would not have fixed this. Better context architecture would have.
IBM Technology’s transcript cuts to the core:
“Prompt engineering gives you better questions. Context engineering gives you better systems.”
The anatomy of what the model actually sees
When a production agent makes an inference call, the context window is not a blank slate filled with your prompt. It is an assembled artifact, and at scale, most of it is dynamic.
IBM Technology’s benchmark holds: roughly 20% of the context is static (system instructions, tool definitions, identity and safety guardrails), and 80% is dynamically assembled at runtime. That runtime assembly includes retrieved documents from vector stores, compressed conversation history, agent scratchpad and task state, current tool outputs, and the injected task-specific data that makes this call different from the last one.
The critical insight is that the 80% is the engineering problem. Static instructions are a one-time design decision. The dynamic 80% is what changes with every call, every user, every task state. Every decision about what to retrieve, how to compress conversation history, when to prune tool outputs - those are architectural decisions with direct, measurable impact on output quality.
This is why the r/LocalLLaMA community thread titled “Context Engineering = Information Architecture for LLMs” has held up as one of the most shared framings. It’s not wrong. Designing what goes in the context window is structurally similar to designing a database schema: the decisions you make upfront determine what queries (inferences) are possible and how expensive they are.
The four components of runtime context
Production agents pull from four sources at inference time:
1. Memory (retrieved from storage)
Past interactions, user preferences, domain knowledge. Retrieved via vector search, not loaded wholesale. The retrieval decision - what to retrieve and how many chunks - is one of the highest-leverage decisions in context engineering.
2. State (agent’s working memory)
Where the agent is in a multi-step process. Did the database query succeed? What was the result? What was the user’s clarification in turn 3? This is the part that breaks most often in stateless implementations and the part that requires a real state store in production.
3. Tools (available capabilities)
Not just function definitions but the full interface design: what each tool does, when to use it, what inputs it expects, what constraints apply. Tool descriptions are context. Poorly written tool descriptions are context rot waiting to happen.
4. History (compressed conversation)
In short interactions, you can fit the full conversation. In long agentic workflows - a code review agent that has made 40 tool calls over 20 minutes - you cannot. You need a compression strategy: summarize older turns, keep recent ones verbatim, remember critical decisions explicitly.
The architectural question is not whether to have these four sources. It’s how to budget the context window across them at runtime, and how to keep each source clean.
Context rot: the quiet failure mode
You assume that more context means better performance. You have a 1 million token window. Fill it with everything: every tool call output, every past interaction, every retrieved document, every intermediate reasoning step. The model has everything it needs.
The output gets worse.
This is context rot, and it’s now recognized as one of the fundamental challenges in production LLM systems. Weaviate’s engineering team describes it directly:
Context rot degrades model performance as context windows grow with poorly curated information.
The mechanism is attention dilution. Language models attend to tokens, and attention is not uniform - proximity matters, relevance signals matter, and noise tokens compete for attention budget with signal tokens. A 1M token window stuffed with low-relevance content is not a 1M token advantage; it’s a handicap.
The curve above is illustrative, derived from patterns documented in Liu et al. “Lost in the Middle: How Language Models Use Long Contexts” (2023) - which demonstrated empirically that retrieval accuracy degrades when relevant content is in the middle of a long context, not at the edges - combined with Weaviate’s context engineering analysis and practitioner benchmarking from r/LocalLLaMA. The specific “rot onset” threshold varies by model and task; the structural shape holds consistently: uncurated append-only context degrades faster and peaks earlier than curated context.
The curve has a peak. Before the peak, adding context improves output quality - you’re giving the model what it needs. After the peak, you’re adding noise that crowds out signal. The engineering work is to stay in the high-quality zone: aggressive curation, relevance scoring, and pruning.
This is why context engineering is closer to data engineering than to prompt writing. You are building a pipeline that ensures only high-signal tokens reach the model at each step. Don’t retrieve top-k chunks blindly - score for relevance to the current task state. Summarize old conversation turns rather than appending indefinitely. Extract only the fields the model needs from API responses; don’t dump the full JSON. And treat the same document retrieved twice as noise, not reinforcement.
The r/LocalLLaMA community has been building this discipline empirically. One team reported processing 1M+ emails for a context engineering pipeline and found that naive retrieval consistently hurt results; section-level retrieval with relevance thresholds was the fix.
Memory architecture: three tiers
Enterprise agentic systems that survive production have converged on a three-tier memory architecture. The inspiration from cognitive science is not accidental - the constraints are structurally similar.
Tier 1: Working Memory (in-context)
What the model sees right now. Fastest, scarcest. The constraint is the context window size. Managing this tier is about real-time curation: every token that goes in should earn its place.
Tier 2: Episodic / Short-term Memory
The session state. Compressed conversation history, task checkpoints, intermediate outputs. Stored in Redis or a fast key-value store. This is where you implement rolling window compression: keep the last N turns verbatim, summarize everything before that into a compact representation, and explicitly save critical facts (user decisions, error states, confirmed data) as structured notes rather than relying on the summary.
Tier 3: Semantic / Long-term Memory
The persistent knowledge base. Vector embeddings of past interactions, user preferences, domain knowledge. Retrieved via hybrid semantic + keyword search. This is what makes an agent feel like it “remembers” you across sessions - but the implementation is retrieval, not actual memory.
The engineering challenge is the bridge: RAG retrieval moves Tier 3 to Tier 1, and every retrieval call is a context budget decision. How you chunk documents, what embedding model you use, how you rank retrieved candidates, and how many tokens you’re willing to spend per retrieval - these are the decisions that determine whether your agent is accurate or not.
The r/LocalLLaMA thread “Does Context Engineering (RAG) actually reduce hallucinations?” has a clear practitioner consensus: yes, but only if retrieval is good. Bad RAG - returning long, low-relevance chunks - can increase hallucinations by flooding the context with authoritative-sounding but irrelevant content.
The stateless default: a multi-step liability
This is the one that costs teams the most time.
Most tutorials and demos build stateless agents: each LLM call is independent. The system prompt is static. User message comes in, response goes out. Clean, simple, fast to build.
It works until the agent needs to do anything that takes more than one step.
A code review agent makes this concrete. It reads the diff, queries the internal style guide, runs a security check tool, retrieves relevant past PR decisions, and writes the review. Five steps. Five steps. In a stateless design, by the time you’re at step 4, the agent may have no access to what the security check from step 3 returned unless you explicitly passed it forward. By step 5, if the context is near the limit, early tool outputs may be truncated.
The Adi Polak podcast that circulated widely on X this week framed it clearly: stateless prompt tricks fall short for agentic work because agents need stateful context management - the context window must carry forward what matters and discard what doesn’t, across the entire task trajectory.
The fix requires actual infrastructure: a state store, a compression strategy, and a clear policy for what gets carried forward versus dropped. Agent state must be persisted externally (task progress, tool outputs, user decisions) so it survives across LLM calls and can be retrieved selectively. The sequence of agent steps should be treated as a first-class object: the context at step N is a function of what happened at steps 1 through N-1, curated. For long-running tasks, snapshot state periodically; this enables retries and debugging without full replay.
@arfujeddy14 articulated the practitioner shift on X this week:
“Power users stopped obsessing over prompt templates. They feed rich context first — notes, goals, constraints — and let the model reason. The AI suddenly stops guessing and starts executing.”
The language is informal but the insight is correct. When context is rich, specific, and well-structured, the model doesn’t need clever prompting - the signal is high enough that it reasons correctly. When context is thin or poorly curated, no amount of prompt engineering compensates.
The harness: context engineering as systems engineering
At scale, context engineering becomes harness engineering - the design of the system that assembles, manages, and curates the context at runtime.
@LiorPollak’s observation from April 2026 is worth quoting directly:
“Claude Code doesn’t use a static system prompt, it dynamically assembles one with conditional blocks for identity, rules, tools, safety, Git context, and user flags. This is fine context engineering in practice.”
The leaked Claude Code source code confirms this claim is accurate. constants/prompts.ts exposes an async getSystemPrompt() function (line 444) that assembles an array of named sections at runtime. There is an explicit SYSTEM_PROMPT_DYNAMIC_BOUNDARY string constant (line 114) separating the static prefix - globally cacheable across API requests - from dynamic sections managed by a systemPromptSection() registry with per-section memoization. Sections re-evaluate on /clear and /compact; the sole exception is MCP server instructions, which use DANGEROUS_uncachedSystemPromptSection() because MCP servers connect and disconnect mid-session.
| Claimed block | Source function (constants/prompts.ts) | Nature |
|---|---|---|
| Identity | getSimpleIntroSection() | Static, cached globally |
| Rules | getSimpleDoingTasksSection() | Static, with USER_TYPE conditionals |
| Tools | getUsingYourToolsSection(enabledTools) | Conditional on active tool set |
| Safety | getActionsSection() | Static |
| Git context | computeSimpleEnvInfo() - calls getIsGit() + worktree check | Dynamic, memoized |
| User flags | getLanguageSection(), getOutputStyleConfig(), getSessionSpecificGuidanceSection() | Dynamic, memoized |
This is what a harness looks like. Not a static string. A programmatic assembler that reads runtime conditions - what git branch are we on, what tool is the user using, what flags are set, what is the current task - and constructs the appropriate context for this specific inference call.
The harness has several responsibilities:
Context assembly: pull from all four sources (memory, state, tools, history), score for relevance, budget against the context limit, and assemble in the right order. Order matters - the model attends more to content near the end of the prompt. Put the most task-critical context closest to the generation point.
Tool interface design: tool descriptions are context. A well-written tool description tells the model what the tool does, when to use it, what inputs are valid, and what constraints apply. A poor tool description produces tool misuse. This is the most underappreciated part of context engineering in enterprise deployments - teams spend weeks on the model and minutes on tool interfaces.
State persistence and retrieval: the harness owns the state store. It decides what to persist after each agent step, what to retrieve at the start of each step, and how to compress history over time.
Context observability: you cannot debug a context you cannot inspect. The harness should log the full assembled context for every LLM call (or a sample thereof) so that when the agent produces wrong output, you can look at exactly what it saw. This is the distributed tracing problem for AI - without it, debugging is guesswork.
MCP: the integration layer that changes the harness
The Model Context Protocol, now governed by the Agentic AI Foundation under the Linux Foundation, has become the universal standard for connecting agents to external tools and data sources. 97M+ monthly SDK downloads. Adopted by Anthropic, OpenAI, Google, Microsoft.
MCP matters for context engineering specifically because it standardizes how tools surface to the harness. Without MCP, every integration is bespoke: custom tool schemas, custom authentication, custom output formats. With MCP, tools expose a standardized interface that the harness can discover, invoke, and consume uniformly.
The harness doesn’t need to know whether a tool is a REST API, a database query, or a local function - it speaks MCP and the tool adapter handles the translation. This separation of concerns is what makes context engineering composable at enterprise scale.
What MCP doesn’t solve is context budgeting. That’s still on the harness. MCP tells you how to invoke tools; it doesn’t tell you how many tool calls to make, which results to keep in context, or how to compress tool output history. Those decisions remain the core engineering problem.
The 2026 emerging pattern: LlamaIndex for data structuring (retrieval, chunking, embeddings) + LangGraph for agent orchestration + MCP for tool integration. This is the stack that appears most consistently across production engineering posts.
What production looks like: evidence from the field
The HN “Get Shit Done” system from March 2026 (473 upvotes, 254 comments) is the clearest publicly documented evidence of context engineering delivering real velocity. The author reported 250K lines of code written in under a month, describing a meta-prompting + context engineering + spec-driven system. The discussion was split:
“With GSD, I was able to write 250K lines of code in less than a month, without prior knowledge of claude.”
And the skeptic:
“I’ve tried it, and I’m not convinced I got measurably better results than just prompting Claude Code directly.”
Both responses are instructive. The system works when you have a clear spec - the “If you know clearly what you want” qualifier in the original post is load-bearing. Context engineering amplifies clarity; it does not manufacture it. If your task spec is vague, a better harness won’t help.
The r/LocalLLaMA “388 Tickets in 6 Weeks” post tells a similar story: concrete task specs, structured context injection, and iterative feedback yielded measurable productivity gains. The pattern is consistent.
On the enterprise side, @Aiszone_ described the practical starting point:
“le premier truc que j’ai fait c’est un CLAUDE.md de 200 lignes avec les regles du projet et le workflow. opus 4.6 tourne h24”
Translation: a 200-line structured context document with project rules and workflow definition. Not a clever prompt. A context artifact.
A Vizuara bootcamp series on context engineering (12K+ views within the last 30 days) reported that Gartner is projecting 40% of enterprise applications will use task-specific AI agents by late 2026. If that number is even half right, context management becomes infrastructure - the kind of thing you build standards around, not the kind of thing each team improvises independently.
The unsolved part
The r/aiagents thread from March 2026 cuts through the optimism:
“Context engineering is the new buzzword. But nobody’s solving the actual hard part.”
The thread’s 27 comments converge on what the hard part actually is: coherent stateful context management across long, multi-step, multi-agent workflows. Not RAG. Not CLAUDE.md files. Not memory tiers.
The problem is this: when an agent runs for 200 steps - spawning sub-agents, making tool calls, handling errors, retrying - the context at step 150 should be an intelligent compression of everything that happened in steps 1-149. Not a dump. Not a truncation. A curated, structured representation of the task state that gives the model at step 150 exactly what it needs and nothing else.
Nobody has a clean solution to this at scale. The Agentic Context Engineering (ACE) paper from arXiv (October 2025, surfaced on HN in March 2026) proposes treating context as an evolving playbook: a structured document that accumulates strategy, reflects on failures, and curates successful patterns - updated after each step by a separate reflection module. The approach prevents collapse (context becoming incoherent over many steps) and enables cross-session learning (the playbook improves across multiple runs of similar tasks).
It’s promising. It’s not production-hardened. And it requires compute overhead - a second LLM call per agent step to update the playbook - that doesn’t fit every budget.
The r/LocalLLaMA community’s answer is more pragmatic: force structure. Instead of letting context evolve organically, define explicit schemas for agent state and enforce them at every step. Structured scratchpad. Typed task state. Explicit transition rules. Messy, but it survives 200 steps in production in a way that free-form context evolution doesn’t.
What this means for enterprise architecture
Context pipelines need the same engineering discipline as data pipelines. Schema design. Quality gates. Observability. SLAs on retrieval latency. Data lineage (where did this chunk in the context come from?). Version control for context templates. The teams that are winning in production have built this infrastructure; the teams still struggling are treating context as an afterthought.
Test your context, not just your prompts. A prompt-centric test suite tells you whether the model responds correctly to a given input. A context-centric test suite tells you whether the assembled context for a given task state is correct - whether the right documents were retrieved, whether the history compression preserved the critical facts, whether the tool output is in a format the model can use. These are different tests, and you need both.
Build a context compiler, not a prompt template. The harness that assembles context at runtime is a first-class engineering artifact. It should be version-controlled, tested independently, monitored in production, and owned by a clear team. @Aiszone_’s 200-line CLAUDE.md is the minimum viable version. A production harness for a multi-agent system is a service.
Plan for context rot before you hit it. Set up retrieval quality metrics before you go to production. Monitor context fill rates. Alert when average context relevance scores drop below threshold. By the time you notice context rot in output quality, the problem has existed for a while.
The maturity arc
Looking at where the field is in April 2026:
Solved (production-ready): Static context design (system prompts, tool definitions), RAG retrieval patterns, short-session episodic memory, MCP-based tool integration, single-agent context management.
Maturing (production-possible, operationally complex): Long-session stateful context, multi-agent context coordination, relevance-scored dynamic retrieval, context observability tooling.
Unsolved (research frontier): Long-horizon coherent context management across 100+ agent steps, efficient cross-agent context sharing without information leakage, automated context quality measurement, context evolution that preserves strategic coherence across many sessions.
The enterprises that treat the “maturing” category as infrastructure problems - not model problems - are the ones building reliably. The enterprises that wait for model capability improvements to paper over context management debt are accumulating it.
Conclusion
Context windows aren’t just a memory limit - they’re a design constraint that shapes how you structure information. (I wrote a shorter version of this point in a micropost in February. I was underselling it.)
The real statement is stronger: context management is the primary engineering discipline of agentic AI systems. More than model selection. More than prompt crafting. More than fine-tuning. If the context is wrong, the model - however capable - reasons on bad inputs and produces bad outputs. If the context is right, you need surprisingly little magic from the prompt.
Karpathy’s framing - context + spec + harness - is the most useful mental model available right now. Context is what the model knows. Spec is what it’s trying to do. Harness is the system that puts them together correctly at runtime. Of the three, harness engineering is the least written about and the most critical for production reliability.
The discipline is young. The tooling is immature. The community is still working out the vocabulary. But the fundamental insight is not going anywhere: the context window is finite, attention is not uniform, and building reliable agentic systems means engineering what goes inside it with the same rigor you’d apply to any critical data system.
Build the harness. Own the context. That’s the job.
References:
- Karpathy, A. (2026, March 21). Context and spec engineering. X (Twitter).
- IBM Technology. (2025, August). Context Engineering vs. Prompt Engineering: Smarter AI with RAG & Agents. YouTube.
- LangChain. (2025, July). Context Engineering for Agents. LangChain Blog.
- Weaviate. Context Engineering - LLM Memory and Retrieval for AI Agents. Weaviate Blog.
- Polak, A. (2026). Context Engineering with Adi Polak. Podcast, circulated via X.
- Chen et al. (2025, October). Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models. arXiv.
- Neo4j. Why AI Teams Are Moving From Prompt Engineering to Context Engineering. Neo4j Blog.
- Gartner. Context Engineering: Why it’s Replacing Prompt Engineering for Enterprise AI Success. Gartner.
- Pollak, L. (2026, April 6). Claude Code context assembly. X (Twitter).
- r/aiagents. (2026, March 18). “Context engineering is the new buzzword. But nobody’s solving the actual hard part.”. Reddit.
- Schmid, P. The New Skill in AI is Not Prompting, It’s Context Engineering. philschmid.de.
- LlamaIndex. Context Engineering - What It Is and Techniques to Consider. LlamaIndex Blog.
- arfujeddy14. (2026, April 6). Power users stopped obsessing over prompt templates. X (Twitter).
- Aiszone_. (2026, April 6). CLAUDE.md 200 lignes. X (Twitter).
- Liu, N. F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172. Empirical study demonstrating LLM performance degrades when relevant information is placed in the middle of long contexts - the foundational backing for the context rot curve shape.