Skip to content
WhySoGeek.
AI

Context Rot: Why LLMs Get Worse With More Tokens

A model's answers degrade as you add tokens, long before the window fills. Here is what context rot is, why it happens, and how to design around it.

Sam Carter 8 min read
Cover image for Context Rot: Why LLMs Get Worse With More Tokens
Photo: david.orban / flickr (BY 2.0)

A million-token context window sounds like infinite room. It is not. Fill even a fraction of it and the model's answers quietly get worse, and the failure mode is sneaky because nothing errors out. This is context rot, and it changes how you should design any long-context system.

Quick answer

Context rot is the measurable decline in an LLM's output quality as you add input tokens, and it happens well before the model hits its window limit. Models attend more to the start and end of the context and less to the middle, so facts buried mid-context are missed more often. Degradation is non-linear: some models are fine at 32K tokens and collapse at 64K. The fix is to keep context short and relevant, put critical information at the edges, and prefer retrieval over stuffing.

Key takeaways

  • More tokens can mean worse answers, even inside the stated window.
  • Rot is not overflow. It starts long before you reach the token limit.
  • The "lost in the middle" effect means mid-context facts get ignored.
  • Degradation is non-linear. Models hit cliffs, staying solid then collapsing suddenly.
  • Design for it: trim context, place key facts at the start and end, and retrieve instead of dumping.

What context rot actually is

Context rot describes a specific, repeatable phenomenon: as the number of input tokens grows, output quality drops. The crucial point is that this is not the model running out of room. A model with a 200K window can start degrading at 40K or 80K tokens, well short of the ceiling.

That matters because vendors advertise the ceiling. A big context window is a capacity number, not a quality guarantee. Effective context, the range where the model still reasons well, is usually far smaller than the advertised maximum.

Why it happens

The lost-in-the-middle effect

Models do not attend to every token equally. They weight the beginning and end of the context more heavily than the middle. Place a needle fact at 30 to 70% depth and retrieval accuracy drops several points compared to placing it at the edges. The attention pattern is roughly U-shaped when the window is under half full, and shifts to favoring the most recent tokens once it is more than half full.

The practical takeaway: the middle of a long prompt is the worst place to put something important.

A U-shaped attention curve showing higher weight at the start and end of context
Photo: Salisbury and South Wiltshire Museum, Katie Hinds, 2009-03-19 16:41:54 / wikimedia (BY-SA 2.0)

Non-linear cliffs

Degradation is not a gentle slope. Research measuring quality across context lengths finds models that perform well at one length and collapse at the next step up. One model holds together at 32K and falls apart at 64K; another is stable until it suddenly is not. You cannot assume that "a bit more context" costs "a bit more quality."

Multi-needle is harder than single-needle

A single needle-in-a-haystack test overstates real performance. Production tasks often need several facts retrieved and combined from across a long context. Multi-needle retrieval degrades faster, which is why effective context for real workloads sits below the single-needle numbers.

SymptomCauseDesign response
Ignores a fact you providedLost in the middleMove it to the start or end
Fine on short prompts, wrong on longNon-linear cliffCap context length below the cliff
Misses one of several factsMulti-needle degradationRetrieve only what is needed
Confidently wrong on huge contextRot plus distractionPrune irrelevant tokens

Designing around context rot

The engineering response is not to celebrate bigger windows but to use less of them, deliberately.

  • Retrieve, do not dump. Pulling in only relevant chunks keeps the context short and the signal high. This is the core case for RAG. Compare the trade-offs in RAG versus long context.
  • Position matters. Put the question and the most critical facts at the start and end, never buried in the middle.
  • Prune aggressively. Every irrelevant token both costs money and dilutes attention. Compress or summarize history rather than appending it forever.
  • Measure your model's cliff. Run a multi-needle test at increasing lengths for the specific model you use and stop well before the drop-off.
StrategyWhen it wins
RAG with tight chunksLarge corpora, precise facts needed
Long context, well organizedCohesive single document, edges matter
Summarize-then-reasonLong conversations or histories
Hybrid (retrieve into a modest window)Most production systems

What to do right now

  • Benchmark effective context, not advertised context. Run a multi-needle test on your model and find where quality drops.
  • Keep prompts lean. Remove boilerplate, dedupe, and summarize history instead of appending it.
  • Anchor critical instructions at the very start and the very end of the prompt.
  • Switch to retrieval when context grows large. See adaptive RAG for retrieving only when needed.
  • Manage agent memory explicitly rather than letting history balloon; read context engineering patterns for agents.
  • Fix your chunking so retrieved context is dense with signal. See RAG chunking strategies.

Frequently asked questions

If the window is 1M tokens, why not just use it all?

Because effective context is far smaller than the advertised window. Filling it triggers rot: mid-context facts get ignored and quality drops. The window size is a ceiling, not a recommended load.

Does context rot affect every model equally?

No. Some frontier models hold quality much further into their window than others, and the cliff location varies. That is why you should measure the specific model you deploy rather than trusting a spec sheet.

Is RAG immune to context rot?

RAG mitigates it by keeping the prompt short and relevant, but it is not immune. If you retrieve too many chunks and stuff a long context, rot returns. Retrieve tightly and rerank.

How do I test for it myself?

Run a multi-needle retrieval test: hide several facts at varying depths across increasing context lengths and measure how many the model recovers. The length where accuracy drops is your practical limit.

#llm#long-context#prompt-engineering

Sources & further reading

Keep reading