TL;DR
  • An LLM has no memory between calls. Its working memory for a single request is one thing: the tokens in the context window.
  • Attention is shared across that working memory, so the more you load into it, the less focus anything in it receives.
  • Reliable behavior comes from managing working memory on purpose: keep what the task needs, clear what it does not.

A language model holds no state between calls. Everything it knows in a given moment is the text in front of it: the system prompt, the conversation so far, the files it has read, the output of its tools. That bundle of tokens, the context window, is the model's entire working memory for that request. There is no other store it can reach.

The obvious lever looks like size. Most models I reach for now offer a million tokens of working memory, and the pitch is to stop curating and let the agent keep everything. I ran agents that way for a while. On easy recall it held: ask for something phrased the way it appears in the log, and the agent finds it deep in the context. Reliability broke down when the answer had to be worked out instead of looked up. The teams who tested this carefully found the same pattern: once you remove the literal word overlap that lets a model shortcut "find the needle," accuracy that was near-perfect on short inputs falls toward a coin flip well inside the advertised window.

So the million-token number describes how much working memory the agent will accept, not how much it can hold in focus at once. Those are different quantities, and reliability depends on the second one (see Fig. 1).

FIG. 1Working memory accepted vs. working memory you can trust
≈ 28k reliable 0 128k 256k 512k 1M WORKING MEMORY THE MODEL ACCEPTS
~28k
depth you can trust · approximate
Share of the 1M window3%
Fig. 1 — Read this as a sketch of the idea. The full bar is the working memory the agent accepts; the model takes every token. The bright zone is how deep it stays reliable, and it narrows sharply when the answer does not share words with the question. Drag from a literal lookup to real reasoning: one window, very different usable depth. The exact line shifts with the model and the task.

The large window still earns its place. It absorbs long histories, keeps a multi-step task from hitting a hard cutoff, and leaves room for tools. It also removes the forcing function: nothing stops you from filling it, so the agent will carry three hundred thousand tokens of accumulated context and keep going right up until its answers quietly degrade and the cause is hard to name. Keeping an agent reliable means keeping its working memory clean, and that starts with knowing what the working memory is made of.

What the working memory is made of

Working memory is counted in tokens, the small pieces text is split into, each one roughly three-quarters of a word. A four-thousand-token request is a few pages; a hundred-and-fifty-thousand-token request is a small book.

From text to tokens · this line = 8 tokens
"Why isn't INotifyPropertyChanged firing?"
Why ·isn't ·INotifyPropertyChanged INotifyPropertyChanged → 4 ·firing ?
· marks a leading space; it rides along as part of the token.
Real tokens from a current tokenizer (o200k, behind GPT-4o and newer), shown as they split inside this sentence. Common words stay whole; one identifier, INotifyPropertyChanged, becomes four tokens. Different models split differently, GPT-4 and Claude each carve this up another way, so treat the pattern as the lesson and expect the exact pieces to vary. Your working memory is counted in tokens, not words.

What turns those tokens into behavior is attention. For each token it generates, the model weighs every token already in the working memory and decides how much each one should shape what comes next. Two facts about that process drive everything else.

The working memory is finite. Past the token limit, the tooling around the model drops or summarizes something to make room; nothing waits outside. The limit is the smaller issue. A fact sitting in the middle of a long working memory loses reliability well before you reach the edge of it.

Attention is shared. It is divided across everything present, so adding tokens lowers the focus on each one (see Fig. 2). A sentence in a four-thousand-token request holds far more of the model's attention than the same sentence inside a hundred and fifty thousand. The words are identical; the attention they command is not.

0.4%
share of a fixed signal
Fig. 2 — Picture it this way: if attention were split evenly, a fixed signal of about 800 tokens would lose its share as the working memory grows. Real models do not divide attention so cleanly, but the direction holds. Drag the marker: at a few thousand tokens the signal keeps about a third of the focus; at 200k the same tokens barely register.

You choose what goes in

If the context window is the agent's working memory, then writing a prompt is deciding what it gets to think with. A weak answer is often a working-memory problem before it is a reasoning problem: the fact that mattered was missing, buried, or outnumbered.

That changes a few defaults. Loading an entire codebase or a long document feels thorough, and it lowers the quality of the answer, because the part you care about now competes with thousands of tokens that have nothing to do with the task. Handing the agent the three files that matter beats handing it thirty.

It also explains the long-session drift from the first section. As an agent runs, tool output and earlier steps accumulate in its working memory. The instructions you gave at the start are still in there, outnumbered, and the agent begins acting on whatever is most recent. The repair is to keep the working memory current: summarize and clear as you go.

A binding bug, two working memories

Say a binding in an Uno Platform view stops updating, and you bring an AI assistant to it. Same model, same question. The only thing that differs is what fills its working memory (see Fig. 3).

FIG. 3Same prompt, two working memories
A · flood it

You paste the whole 2,000-line view, three viewmodels, and the full build log, then ask why the binding will not update. The line that matters, a setter that never raises a change notification, is a thin signal among thousands of unrelated tokens. The model searches for it and answers in generalities.

B · curate it

You paste the bound property, its setter, the one XAML element, and the eight log lines around the error. The working memory is almost all signal, the relevant tokens hold the model's attention, and the answer names the cause: the setter never raises the change notification, so the binding never hears the update.

Fig. 3 — Each row is a chunk of pasted context; the bright rows carry the answer. On the left the signal is one row among seven; on the right it fills most of the working memory.

Curating by hand works. You can also let tooling fill the working memory for you, which is the point of the App MCP in Uno Platform Studio.

Studio shortcut

Skip the copy-paste. The App MCP connects to your running app and hands the agent the live visual tree and the control's real properties and binding values, so it reads the true runtime state instead of guessing from pasted source, then verifies its own fix against the running app. The curation happens for you.

Working memory fills as it runs

Watch a real session and you can see the working memory fill whether you manage it or not (see Fig. 4). A fresh call is mostly headroom: a short system prompt, your instructions, the one file in question, your request. That file is a large share of the few thousand tokens present, so it holds strong attention.

Then the run continues. Every file read, every build log, every test result is appended. Forty turns later the same working memory holds well over a hundred thousand tokens: a long history, stacks of tool output, your original instructions far below where the model is writing, and the file you care about reduced to a sliver. Nothing was deleted. It was outnumbered.

This is the drift from the first section, seen from the inside. The window did not shrink; the useful fraction of it did. The repair follows from the picture: summarize the history, drop tool output you are done with, and keep the current task and the relevant file near the end of the window, where the model is actively writing and attention is strongest. The teams running long agents have names for these moves, compaction, external notes, and handing a slice to a fresh sub-agent, and they share one aim: keep the working memory mostly signal.

FIG. 4One working memory, scrubbed across a session
33%
of working memory is your file
Turnturn 1 / 40
Window size4,000 tok
Reliability on that filereliable
Fig. 4 — Drag from turn 1 to turn 40. Nothing is removed; history and tool output pile up and push your file from about a third of the working memory down to a sliver, and reliability on that file drops well before the token limit. The curve is drawn for illustration; the direction is the point. Clearing the pile keeps the file near where attention is strongest.

One thing to try this week

Here is the move that builds the instinct fastest. Open the longest-running AI chat or agent session you have going right now, and find the one constraint that has to hold for its work to be correct: the target framework, the API it must not touch, the rule you keep retyping. Move that line into the place that gets re-sent on every turn, the system prompt or a project rules file, so it stays in working memory instead of sinking out of it.

That is the whole exercise. One constraint, relocated from the fragile tail of a conversation into context that is always present. Do it once and you start to feel the difference between an agent that holds its instructions and one that loses them.

Same model, same question. Different working memory.