As Andrej Karpathy has said, if you peer inside a large language model (LLM) you will find a compressed version of the internet. Joining this comprehensive “opening book” of knowledge with deep inference and reasoning is incredibly powerful.

At the same time, there are many situations where LLMs seem to fall short. Sometimes, it’s simply because we’re all figuring out how to best interact with them. In other cases, the LLM does not have the knowledge or expertise that is pertinent to the particular issue we care about.

Regardless, managing an LLM’s working memory - context - is quite important. This is especially true in situations where sustained inquiry is important. In this post I want to share what I feel are the primary approaches for managing context: expand, reflect, curate, delegate, and obviate.

What is Context?

LLMs work by receiving a set of input information (often text), processing it using its neural network, and then producing output information following the model’s training objectives. The input information is referred to as the model’s context. A simplified view of the situation is as follows:

A horizontal bar divided into three segments: context history (neutral), query (accent), and remaining capacity (dashed). — Three parts to context: prior conversation history, current query, remaining capacity.

Context builds up in a subtle way. In conversations, the LLM is re-invoked each time you send a query - in software language this is referred to as being “stateless”. Therefore the context needs to contain the full history of your previous queries and responses, otherwise the LLM would not know what the heck you have been talking about.

Our brains work differently! We’ll come back to this point later.

Context contains more than just conversation history. It also stores standard instructions created by the model author (e.g. OpenAI, Anthropic, etc), special instructions you may have given it in claude.md / agents.md, “memory”, and so on.

All of this information is in the context, serving as an LLM’s working memory. Context is a scarce resource, therefore managing it well is critical for deep, sustained work.

Technique: Expand

The most obvious way to deal with context is to make it bigger.

A smaller thing, on top of a bigger thing.

This isn’t easy. As mentioned in the famous “Attention is All You Need”, both memory and compute increase quadratically with sequence length. Despite this challenge, context size has increased dramatically over the past few years:

Model family	Year	Context (tokens)
GPT-3	2020	~4K
GPT-4	2023	~32K
GPT-4o	2024	128K
GPT-5	2025	~400K
GPT-5.4	2026	~1M

Liu et al. note that not all context is created equal: “performance is often highest when relevant information occurs at the beginning or end of the input context”. This is why they named their paper “Lost in the Middle”:

Context is most effective when it's at the beginning or end.

I have noticed this as well.

Technique: Reflect

Regardless of how big our context is, we should be thoughtful and reflective about how we use it.

When our queries are clear and succinct, this aids the LLM in producing helpful responses, reducing back-and-forth. But context contains not just our queries, but also information retrieved from tools and the information they produce. We want this to be clear and succinct too. This has led to debates about what kinds of tools are most suitable, for example MCPs versus command-line tools.

“Thoughtfulness” applies to the LLM side of the conversation too. More perceptive models can collaborate to arrive at satisfactory outcomes with fewer round trips, using less context. Moreover, modern LLMs frequently contain an inner monologue called “chain of thought” that helps guide next steps. Strong, crisp chain of thought reduces context usage.

Technique: Curate

Notice we’ve now moved beyond thinking about how much context we’re using - we’re considering what goes into context in the first place. I think of the process of how to arrange context as curation.

We’ve noted that LLM use often involves tools. Retrieval Augmented Generation (RAG) searches knowledge bases for relevant information and brings it into context when deemed necessary. There was a spike in popularity when first introduced, but now there is a broader debate about the best way to use tools to add relevant information to context. Memory systems perform a similar function - this site has a bunch of good references. For both, careful implementation is important, otherwise they’ll junk up context rather than enrich it.

If you’ve used Claude Code, Codex, or had a very long conversation with a chatbot, you’ve probably seen “compaction” happen. But in principle this could happen whenever we want, compressing, abstracting, and summarizing with the goal of making the context denser and more useful.

Before: a context bar nearly full with the text four score and seven years ago. After: the same bar with a tiny accent box reading 80 years and vast remaining capacity. — Curation creates capacity by compressing chatty context.

Technique: Delegate

Another approach is to delegate. The LLM simply passes along work to somewhere else, rather than using its own context. The most familiar delegation technique is to use a subagent. An agent passes the relevant part of its context to another agent with a subtask to perform. The subagent passes back only the relevant information related to the subtask result.

Main agent with a context bar and subtask box; below, a subagent receives a context slice and task, processes them through a Language Model, and returns a result to the main agent. — A subagent receives only a slice of context and a specific task. Only the result travels back — the main agent's full context is never exposed.

Subagent-based delegation is very useful, but it provides limited opportunity for the agent and subagent to coordinate. The agent ships its context to the subagent, the subagent works, it returns a response, and that’s that. If the subagent learns something “along the way” that’s useful for the agent, there’s no way to report that back.

So is there another form of delegation that is a bit more flexible? This is where recursive language models (RLM) come in.

Two rows comparing a standard Language Model and an RLM. From the user and API perspective the interface is identical; the RLM wrapper is an internal implementation detail. — An RLM looks like a standard LLM but manipulates context very differently.

Counterintuitively, the core idea behind RLM is to delegate all of the context. As Alex Zhang says, “Don’t give the context to the model at all. Instead, give the model an environment to write code, and let the context be a variable to manipulate within that environment.” The RLM uses the code environment to probe, search, and delegate, manipulating the context as it goes. This is very different from the “fire and forget” pattern associated with subagents.

The closest analogy to an RLM I can think of is Ken Burns (!). As I noted in my last post, Burns assembles huge volumes of source material, patiently distilling it with the help of a team of collaborators, in accordance with an overall vision. It’s impossible for Burns to look at every single photo, read every single “letter back home”, or do every single interview himself. Burns has the overall vision, guides the team, refines the narrative, and has the final say on the end result. As he does this, the editorial script for the final product continuously evolves. Burns is the RLM and the script is the context.

The analogy falls over because RLMs do lots of things that humans don’t in similar situations. Context is presented to an RLM as a flat text variable in a REPL environment, so the RLM will look at snippets, search for things, and do lots of crazy low-level string stuff that we don’t need to do.

We humans don’t need to do this because we’re not relying on working memory to do this kind of retrieval and assessment. We’re not built the same way…which leads us to our last technique.

Technique: Obviate

Why not just remove the need for context in the first place - obviate it?

As I’ve already mentioned, the closest biological equivalent to context in humans is “working memory”. Even though our working memory is not that impressive, we humans are quite capable of working on the sorts of deep, sustained efforts for which context is vital. The difference is that we seem to be able to integrate new signals and learnings back into our baseline set of knowledge.

In Andrej Karpathy’s metaphor of an LLM OS, the file system plays the role of a longer-term memory system.

Hub-and-spoke diagram with LLM at center, connected to Model Weights (disk), Context Window (RAM, highlighted), Input Tokens, Output Tokens, and Tools and Memory (peripherals). — Karpathy's LLM-as-OS metaphor: context window as RAM, model weights as disk, tools and external memory as peripherals.

Recent efforts build upon RLM with this paradigm in mind. For example, ThreadWeaver regards context as “RAM”, and a supervisor agent manages subagent “threads”. Thread completion results in a compressed summary which is stored in long-term memory and paged into context when needed. This is analogous to “episodic memory” in humans, which is helpful for us in planning and persisting. It’s much more than mere compaction.

Another approach, continuous learning, directly addresses the fact that LLMs have a fixed training regime. Continuous learning seeks to integrate new knowledge directly back into the LLM itself. There are several proposed approaches for continuous learning, addressing different stages of the LLM training and tuning lifecycle, as surveyed by Wu et al. Continuous learning is a rich and active area that is beyond the scope of this survey, but I anticipate significant progress within 2026.

As continuous learning develops, context will become richer and more useful. The LLM’s “opening book” of knowledge and reasoning will become better and better, and what is left to context will be even more distinctive to the task at hand.

What would it look like if continuous learning really worked? LLMs would become more specialized and personalized. The one you work with and the one I work with would respond differently; feel different.

Their responses would sometimes seem contradictory. That’s nothing new; we’re like that too. The philosopher Henri Bergson, a friend of Marcel Proust, called this the “personal element”:

I smell a rose and immediately confused recollections of childhood come back to my memory. In truth, these recollections have not been called up by the perfume of the rose: I breathe them in with the very scent; it means all that to me. To others it will smell differently.—It is always the same scent, you will say, but associated with different ideas.—I am quite willing that you should express yourself in this way; but do not forget that you have first removed the personal element from the different impressions which the rose makes on each one of us; you have retained only the objective aspect, that part of the scent of the rose which is public property and thereby belongs to space.”

Interestingly, when we meditate on memory deeply enough we arrive somewhere intensely personal.

Note: I (Nathan) wrote this post in its entirety, but I consulted extensively with Claude Opus 4.6 during the editing process. Thanks to Claude for many helpful suggestions.