Why AI Forgets What You Told It (and How to Manage LLM Context)
You're using an AI-powered product. The first few outputs are sharp. Specific. They nail the tone, reference the right details, get the nuance right.
Then you keep going. Ten exchanges in, the quality starts to drift. By twenty, it feels like the AI forgot what you told it five minutes ago. The documents get generic. The suggestions get vague. It's like talking to someone who's slowly losing the thread.
You're not imagining it. There's a well-documented reason this happens, and understanding it changes how you think about building products that work with large language models.
What "context" actually means
Every time you interact with an AI, the model doesn't actually remember your conversation the way a person does. It doesn't have persistent memory between exchanges. Instead, the entire conversation history gets sent back to the model as input every time it generates a response. That input is called the context window.
Think of it like a desk. Every time the AI needs to respond, everything relevant gets spread out on the desk: the system instructions, the conversation so far, any reference material, the current request. The model reads it all, generates a response, then the desk gets cleared. Next turn, everything goes back on the desk, plus the new exchange.
The desk has a fixed size. That's the context window limit. Depending on the model, it might be 8,000 tokens, 128,000 tokens, or even a million. A token is roughly three-quarters of a word, so 128,000 tokens is roughly a 200-page book.
Sounds like a lot. Here's where it gets interesting.
The "lost in the middle" problem
Researchers at Stanford and UC Berkeley published a finding that changed how the industry thinks about context. They found that LLMs don't treat all parts of their context window equally. Information at the beginning and end of the context gets the most attention. Information in the middle gets partially ignored.
This isn't a bug in one particular model. It's a pattern across most large language models. The effect is sometimes called the "lost in the middle" phenomenon, and it has real consequences for any product built on top of these models.
Here's what it looks like in practice. Imagine your AI product has system instructions at the top of the context ("you are a document drafting assistant, match this tone, follow these guidelines"). Then there are 15 back-and-forth exchanges. Then the user's latest request at the bottom.
The model pays strong attention to the system instructions at the top. It pays strong attention to the latest request at the bottom. But those early exchanges in the middle, the ones where the user established preferences, corrected the tone, provided specific details about their situation? Those get less weight. The signal degrades.
That's why the quality of your output can feel like it deteriorates the longer you use the product in a single session. It's not that the AI is getting tired or lazy. It's that the information it needs is getting pushed into the zone where it pays the least attention.
Why longer threads produce worse results
When a conversation is short, everything fits comfortably in the context window and the model can attend to all of it. But as the thread grows, several things happen at once:
The middle gets crowded. Early context (your initial instructions, the first few exchanges where the user established what they wanted) drifts into the middle of the window. That's exactly where attention is weakest.
Signal gets diluted. Every new exchange adds to the context. Routine exchanges ("thank you," "got it") are taking up space and pushing higher-value context further into the middle.
The window eventually fills up. When the conversation exceeds the context window, something has to give. Older messages get truncated or dropped entirely. If those older messages contained critical instructions or preferences, they're gone.
The early exchanges where the system learned tone, context, and specifics are exactly the ones that lose influence as the thread grows.
It starts with modeling, not context management
Here's the thing most teams get wrong. They treat context management as an optimization problem. "Our outputs are degrading, how do we manage the context better?" That's the wrong starting point. By the time you're debugging context, you've already missed the more fundamental question: what does the model of this problem actually look like?
We mean "model" in the design sense. Before you write a line of code, before you think about context windows at all, you need to have a clear picture of the entities in your system and their relationships. What does a user look like? What does a conversation look like? What are the distinct pieces of information the AI needs to do its job? What's the relationship between them?
A document-drafting tool, for example, needs to know about the user's role, their company's tone guidelines, the recipient's history, the specific situation being addressed. Those are discrete, modelable things. They have structure. They can be stored, retrieved, and passed deliberately.
The alternative, which is what most AI products default to, is dump and pray. The entire chat history gets sent to the model and you hope it figures out which parts matter. That's not a context strategy. It's the absence of one. And it works fine for the first few exchanges, because when the context is small, everything gets attention. It only falls apart later, at exactly the point where the user has invested enough to care about quality.
Good modeling up front means you know what the AI needs to know, when it needs to know it, and where that information lives. Context management becomes straightforward because you've already decided what the pieces are. You're not trying to extract signal from noise. You designed the signal from the start.
Separate the conversation from the work
Once you've modeled your problem, a powerful pattern emerges: the conversation and the work don't have to happen in the same context.
Consider an AI product that helps someone draft documents. The natural assumption is that the document gets written inside the chat thread. The user chats, the AI responds, and somewhere in that back-and-forth, a document gets generated. But that means the document is being written in a context polluted by every tangent, every correction that's already been addressed, every routine exchange that added no new information.
A better architecture separates these concerns. The chat thread is for conversation: understanding what the user wants, refining preferences, going back and forth. But when it's time to actually generate the document, that happens in a tool call with its own curated context. The tool gets passed exactly what it needs: the user's tone preferences, the recipient details, the specific points to address, the relevant history. Nothing more.
The chat can grow freely. Twenty exchanges, fifty exchanges, it doesn't matter. The quality of the document doesn't degrade because it was never generated from the chat context. It was generated from a clean, purpose-built context that contains only what the model needs to do that specific job.
This is the difference between an AI product that "uses" a model and one that's been designed around how models actually work. The model doesn't have to hunt through a sprawling conversation to find the relevant details. They've been explicitly passed to it, in the right structure, in the right position.
The tactical toolkit
With the right model and architecture in place, these techniques keep everything sharp at the implementation level.
Keep the context window curated, not bloated
Not every exchange belongs in the context. A routine acknowledgment, a "thank you," a repeated request with no new information: these add noise without adding signal. Smart context management means deciding what stays and what gets summarized or dropped.
Think of it like editing. The context window is your final draft, not your complete browsing history. Every token should earn its place.
Pin critical information where the model pays attention
If certain instructions or user preferences are essential to output quality, they shouldn't be left to drift into the middle of a growing conversation. They should be pinned to positions where the model attends most strongly: the very beginning (system prompt) or restated near the end (close to the current request).
This is sometimes called context anchoring. It's the single most effective technique for maintaining quality over long sessions.
Summarize, don't accumulate
Instead of feeding the model the entire raw conversation history, periodically compress earlier exchanges into a summary. "The user prefers formal tone, works in IT support, has corrected us twice on using first names with clients." A tight summary occupying 200 tokens can carry more useful signal than 2,000 tokens of raw back-and-forth.
The trade-off is that summaries lose nuance. A well-designed system does both: keeps the most recent exchanges in full detail while summarizing older ones.
Use retrieval to complement, not replace, embedded context
For products that need to reference large amounts of information (documentation, past interactions, company knowledge), cramming everything into the context window is the wrong approach. Retrieval-augmented generation (RAG) pulls in only the specific information relevant to the current request.
Instead of giving the model your entire knowledge base, you give it the three paragraphs that actually matter right now. This keeps the context window lean and the signal-to-noise ratio high.
RAG isn't an either/or choice with embedded context. The best systems use both. Core instructions and user preferences stay embedded in the context (anchored where the model pays attention). Reference material, historical data, and domain knowledge get pulled in via retrieval on demand. Think of embedded context as the model's working memory and RAG as the filing cabinet it can reach into when it needs something specific.
Design thread boundaries intentionally
Sometimes the best context management strategy is knowing when to start fresh. If a user's thread has grown to the point where context degradation is noticeable, the product can prompt a new session while carrying forward a summary of what was established.
This is a product design decision, not just a technical one. "Start a new thread" can feel abrupt. "Here's a fresh session with your preferences carried forward" feels thoughtful.
The metrics that matter
If you're building (or evaluating) an AI product, here's what to watch for context-related quality issues:
- Output quality over session length. Compare the quality of the first response in a thread to the tenth, the twentieth. If there's a consistent drop-off, context management needs work.
- Instruction adherence rate. Is the model still following the system instructions after 15 exchanges? After 30? Track how often the output drifts from the defined behavior.
- Token utilization. How much of the context window is being used? How much of it is useful signal versus noise? A context window at 90% capacity filled with raw chat history is a warning sign.
- User corrections per session. If users are re-explaining preferences or correcting tone more frequently later in a session, the context architecture isn't holding.
The landscape is improving, but architecture still matters
Context windows are getting larger. Models are getting better at attending to long contexts. But bigger windows aren't a substitute for good context management, the same way a bigger hard drive doesn't fix a disorganized file system.
A product that intelligently manages what goes into the context window will outperform one that blindly stuffs everything in, regardless of window size. The models reward focus. Give them a clean, well-organized context and they perform. Give them everything you have and hope for the best, and you get the drift you've probably already noticed.
The checklist
Here's what should be in place for any AI product that involves multi-turn conversations:
- A deliberate model of the problem. Before anyone touches context windows, the entities and their relationships are mapped out. What does the AI need to know? Where does each piece of information live? This is design work, not engineering work.
- Separation of conversation and work. High-stakes outputs (documents, reports, recommendations) are generated in their own curated context, not inside the chat thread. The chat is for understanding. The tool call is for doing.
- Critical instruction anchoring. System prompts and key user preferences are positioned where the model attends most strongly, and restated when the context grows.
- Conversation summarization. Older exchanges are compressed into summaries rather than sent verbatim as the thread grows.
- Retrieval architecture. Reference material is pulled in on demand via RAG, complementing (not replacing) the core context that stays embedded.
- Thread length monitoring. The product tracks output quality over session length and has a strategy for when threads get long (summarization, fresh sessions, or both).
- Testing across session lengths. QA includes testing at exchange 1, exchange 10, and exchange 25+. Most teams only test short interactions.
These aren't features your users see. They're design and architecture decisions that determine whether your AI product stays sharp on exchange 20 or starts forgetting what it was told on exchange three.
The answer isn't a better model. It's better modeling of the problem, and better architecture around the model you already have.
Follow the thinking.
We write when we learn something worth sharing. No schedule, no spam.