Product Guide

Why AI Forgets What You Told It (and How to Manage LLM Context)

Generative Labs

AI-powered products degrade over long conversations for a specific, well-documented reason: large language models pay strong attention to the beginning and end of their context window and partially ignore information in the middle. As a thread grows, the instructions and preferences that shaped the early outputs drift into the weak zone. Quality drops. The fix isn't a better model. It's better design around the model you already have.

You're using an AI-powered product. The first few outputs are sharp. Specific. They nail the tone, reference the right details, get the nuance right.

Then you keep going. Ten exchanges in, the quality starts to drift. By twenty, it feels like the AI forgot what you told it five minutes ago. The documents get generic. The suggestions get vague. It's like talking to someone who's slowly losing the thread.

You're not imagining it. Understanding what's actually happening changes how you build products on top of these models.

What "context" actually means

Every time you interact with an AI, the model doesn't actually remember your conversation the way a person does. It doesn't have persistent memory between exchanges. Instead, the entire conversation history gets sent back to the model as input every time it generates a response. That input is called the context window.

Think of it like a desk. Every time the AI needs to respond, everything relevant gets spread out on the desk: the system instructions, the conversation so far, any reference material, the current request. The model reads it all, generates a response, then the desk gets cleared. Next turn, everything goes back on the desk, plus the new exchange.

The desk has a fixed size. That's the context window limit. Depending on the model, it might be 8,000 tokens, 128,000 tokens, or even a million. A token is roughly three-quarters of a word, so 128,000 tokens is roughly a 200-page book.

Sounds like a lot. Here's where it gets interesting.

The "lost in the middle" problem

Researchers at Stanford and UC Berkeley published a finding that changed how the industry thinks about context. They found that LLMs don't treat all parts of their context window equally. Information at the beginning and end of the context gets the most attention. Information in the middle gets partially ignored.

This isn't a bug in one particular model. It's a pattern across most large language models. The effect is sometimes called the "lost in the middle" phenomenon, and it has real consequences for any product built on top of these models.

Here's what it looks like in practice. Imagine your AI product has system instructions at the top of the context ("you are a document drafting assistant, match this tone, follow these guidelines"). Then there are 15 back-and-forth exchanges. Then the user's latest request at the bottom.

The model pays strong attention to the system instructions at the top. It pays strong attention to the latest request at the bottom. But those early exchanges in the middle, the ones where the user established preferences, corrected the tone, provided specific details about their situation? Those get less weight. The signal degrades.

That's why the quality of your output can feel like it deteriorates the longer you use the product in a single session. It's not that the AI is getting tired or lazy. It's that the information it needs is getting pushed into the zone where it pays the least attention.

Why longer threads produce worse results

When a conversation is short, everything fits comfortably in the context window and the model can attend to all of it. But as the thread grows, several things happen at once:

The middle gets crowded. Early context (your initial instructions, the first few exchanges where the user established what they wanted) drifts into the middle of the window. That's exactly where attention is weakest.

Signal gets diluted. Every new exchange adds to the context. Routine exchanges ("thank you," "got it") are taking up space and pushing higher-value context further into the middle.

The window eventually fills up. When the conversation exceeds the context window, something has to give. Older messages get truncated or dropped entirely. If those older messages contained critical instructions or preferences, they're gone.

The early exchanges where the system learned tone, context, and specifics are exactly the ones that lose influence as the thread grows.

It starts with modeling, not context management

Here's the thing most teams get wrong. They treat context management as an optimization problem. "Our outputs are degrading, how do we manage the context better?" That's the wrong starting point. By the time you're debugging context, you've already missed the more fundamental question: what does the model of this problem actually look like?

We mean "model" in the design sense. Before you write a line of code, before you think about context windows at all, you need to have a clear picture of the entities in your system and their relationships. What does a user look like? What does a conversation look like? What are the distinct pieces of information the AI needs to do its job? What's the relationship between them?

A document-drafting tool, for example, needs to know about the user's role, their company's tone guidelines, the recipient's history, the specific situation being addressed. Those are discrete, modelable things. They have structure. They can be stored, retrieved, and passed deliberately.

The alternative, which is what most AI products default to, is dump and pray. The entire chat history gets sent to the model and you hope it figures out which parts matter. That's not a context strategy. It's the absence of one. And it works fine for the first few exchanges, because when the context is small, everything gets attention. It only falls apart later, at exactly the point where the user has invested enough to care about quality.

Good modeling up front means you know what the AI needs to know, when it needs to know it, and where that information lives. Context management becomes straightforward because you've already decided what the pieces are. You're not trying to extract signal from noise. You designed the signal from the start.

Separate the conversation from the work

Once you've modeled your problem, a powerful pattern emerges: the conversation and the work don't have to happen in the same context.

Consider an AI product that helps someone draft documents. The natural assumption is that the document gets written inside the chat thread. The user chats, the AI responds, and somewhere in that back-and-forth, a document gets generated. But that means the document is being written in a context polluted by every tangent, every correction that's already been addressed, every routine exchange that added no new information.

A better architecture separates these concerns. The chat thread is for conversation: understanding what the user wants, refining preferences, going back and forth. But when it's time to actually generate the document, that happens in a tool call with its own curated context. The tool gets passed exactly what it needs: the user's tone preferences, the recipient details, the specific points to address, the relevant history. Nothing more.

The chat can grow freely. Twenty exchanges, fifty exchanges, it doesn't matter. The quality of the document doesn't degrade because it was never generated from the chat context. It was generated from a clean, purpose-built context that contains only what the model needs to do that specific job.

This is the difference between an AI product that "uses" a model and one that's been designed around how models actually work. The model doesn't have to hunt through a sprawling conversation to find the relevant details. They've been explicitly passed to it, in the right structure, in the right position.

The tactical toolkit

With the right model and architecture in place, these techniques keep everything sharp at the implementation level.

Keep the context window curated, not bloated

Not every exchange belongs in the context. A routine acknowledgment, a "thank you," a repeated request with no new information: these add noise without adding signal. Smart context management means deciding what stays and what gets summarized or dropped.

Think of it like editing. The context window is your final draft, not your complete browsing history. Every token should earn its place.

Pin critical information where the model pays attention

If certain instructions or user preferences are essential to output quality, they shouldn't be left to drift into the middle of a growing conversation. They should be pinned to positions where the model attends most strongly: the very beginning (system prompt) or restated near the end (close to the current request).

This is sometimes called context anchoring. It's the single most effective technique for maintaining quality over long sessions.

Summarize, don't accumulate

Instead of feeding the model the entire raw conversation history, periodically compress earlier exchanges into a summary. "The user prefers formal tone, works in IT support, has corrected us twice on using first names with clients." A tight summary occupying 200 tokens can carry more useful signal than 2,000 tokens of raw back-and-forth.

The trade-off is that summaries lose nuance. A well-designed system does both: keeps the most recent exchanges in full detail while summarizing older ones.

Use retrieval to complement, not replace, embedded context

For products that need to reference large amounts of information (documentation, past interactions, company knowledge), cramming everything into the context window is the wrong approach. Retrieval-augmented generation (RAG) pulls in only the specific information relevant to the current request.

Instead of giving the model your entire knowledge base, you give it the three paragraphs that actually matter right now. This keeps the context window lean and the signal-to-noise ratio high.

RAG isn't an either/or choice with embedded context. The best systems use both. Core instructions and user preferences stay embedded in the context (anchored where the model pays attention). Reference material, historical data, and domain knowledge get pulled in via retrieval on demand. Think of embedded context as the model's working memory and RAG as the filing cabinet it can reach into when it needs something specific.

Design thread boundaries intentionally

Sometimes the best context management strategy is knowing when to start fresh. If a user's thread has grown to the point where context degradation is noticeable, the product can prompt a new session while carrying forward a summary of what was established.

This is a product design decision, not just a technical one. "Start a new thread" can feel abrupt. "Here's a fresh session with your preferences carried forward" feels thoughtful.

The metrics that matter

If you're building (or evaluating) an AI product, here's what to watch for context-related quality issues:

Output quality over session length. Compare the quality of the first response in a thread to the tenth, the twentieth. If there's a consistent drop-off, context management needs work.
Instruction adherence rate. Is the model still following the system instructions after 15 exchanges? After 30? Track how often the output drifts from the defined behavior.
Token utilization. How much of the context window is being used? How much of it is useful signal versus noise? A context window at 90% capacity filled with raw chat history is a warning sign.
User corrections per session. If users are re-explaining preferences or correcting tone more frequently later in a session, the context architecture isn't holding.

The landscape is improving, but architecture still matters

Context windows are getting larger. Models are getting better at attending to long contexts. But bigger windows aren't a substitute for good context management, the same way a bigger hard drive doesn't fix a disorganized file system.

A product that intelligently manages what goes into the context window will outperform one that blindly stuffs everything in, regardless of window size. The models reward focus. Give them a clean, well-organized context and they perform. Give them everything you have and hope for the best, and you get the drift you've probably already noticed.

The checklist

Here's what should be in place for any AI product that involves multi-turn conversations:

A deliberate model of the problem. Before anyone touches context windows, the entities and their relationships are mapped out. What does the AI need to know? Where does each piece of information live? This is design work, not engineering work.
Separation of conversation and work. High-stakes outputs (documents, reports, recommendations) are generated in their own curated context, not inside the chat thread. The chat is for understanding. The tool call is for doing.
Critical instruction anchoring. System prompts and key user preferences are positioned where the model attends most strongly, and restated when the context grows.
Conversation summarization. Older exchanges are compressed into summaries rather than sent verbatim as the thread grows.
Retrieval architecture. Reference material is pulled in on demand via RAG, complementing (not replacing) the core context that stays embedded.
Thread length monitoring. The product tracks output quality over session length and has a strategy for when threads get long (summarization, fresh sessions, or both).
Testing across session lengths. QA includes testing at exchange 1, exchange 10, and exchange 25+. Most teams only test short interactions.

These aren't features your users see. They're design and architecture decisions that determine whether your AI product stays sharp on exchange 20 or starts forgetting what it was told on exchange three.

The answer isn't a better model. It's better modeling of the problem, and better architecture around the model you already have.

Frequently asked

Why does AI output quality degrade over long conversations?

›LLMs don't treat all parts of their context window equally. Research shows that information at the beginning and end gets the most attention, while information in the middle gets partially ignored.

⌄LLMs don't treat all parts of their context window equally. Research shows that information at the beginning and end gets the most attention, while information in the middle gets partially ignored. As conversations grow, early instructions and preferences drift into this weak zone, causing output quality to drop.

What is the 'lost in the middle' problem?

›The 'lost in the middle' phenomenon is a research finding from Stanford and UC Berkeley showing that large language models pay the strongest attention to the beginning and end of their context window, while partially ignoring information in the middle.

⌄The 'lost in the middle' phenomenon is a research finding from Stanford and UC Berkeley showing that large language models pay the strongest attention to the beginning and end of their context window, while partially ignoring information in the middle. This affects any AI product built on top of these models.

What is a context window in AI?

›A context window is the total input sent to an AI model each time it generates a response.

⌄A context window is the total input sent to an AI model each time it generates a response. It includes the system instructions, the full conversation history, any reference material, and the current request. The model reads everything in the window, generates a response, then the window is rebuilt for the next turn.

How do you prevent AI from forgetting earlier instructions?

›Pin critical instructions where the model pays most attention: the very beginning (system prompt) or restated near the end (close to the current request).

⌄Pin critical instructions where the model pays most attention: the very beginning (system prompt) or restated near the end (close to the current request). This technique, called context anchoring, is the most effective way to maintain quality over long sessions.

What is the difference between RAG and embedded context?

›Embedded context is information placed directly in the model's context window (like system prompts and user preferences).

⌄Embedded context is information placed directly in the model's context window (like system prompts and user preferences). RAG (retrieval-augmented generation) pulls in specific reference material on demand. The best systems use both: core instructions stay embedded, while reference data is retrieved as needed.

How should AI products handle long conversation threads?

›Separate the conversation from the work. The chat thread handles back-and-forth discussion, but high-stakes outputs (documents, reports) should be generated in a separate tool call with a curated context containing only what the model needs.

⌄Separate the conversation from the work. The chat thread handles back-and-forth discussion, but high-stakes outputs (documents, reports) should be generated in a separate tool call with a curated context containing only what the model needs. This way the chat can grow without degrading output quality.

Considered takes, in your inbox.

We write when we learn something worth sharing. No schedule, no marketing digests. Built for engineers and product owners shipping with agents.