What is context rot in LLM agents?

Context rot is the measured decline in an LLM's accuracy, instruction-following, and reasoning as its active context grows, even when nothing is truncated and the added tokens are irrelevant to the task. Chroma Research's 2025 study found every one of 18 frontier models tested degraded as input tokens grew. It begins well inside the advertised context window.

What is the dumb zone for coding agents?

The dumb zone is the practitioner term, introduced via HumanLayer's analysis of roughly 100k coding-agent sessions, for the 50k-100k token range where agents start producing more errors, tool-call loops, and premature terminations. The model still technically supports 200k or more tokens, but agent-loop reliability drops well before the window fills.

Do 1M-token context windows fix context rot?

No. Improvements at the frontier are concentrated in literal-matching retrieval, raw window size, and cost via prompt caching. Reasoning over long contexts, instruction-following, and agent-loop reliability still degrade, which is why benchmarks like NoLiMa and RULER show effective context far below advertised context.

How do you engineer around long context degradation?

Externalize state to files and git (the outer loop), and reconstruct a small fresh context for each model call (the inner loop). Combine compaction, structured note-taking in progress files, sub-agent delegation with narrow contexts, prompt-cached stable prefixes, and position-aware prompt design that keeps critical information at the end of the prompt.

Is lost in the middle still a problem in newer models?

It's reduced but not eliminated. Liu et al. Documented 20-40 point accuracy drops for mid-context information in 2023, and replications show GPT-4o, Claude 3.5/4, and Gemini 1.5 have shallower U-shaped curves than the models originally tested. The bias remains measurable and still matters for agent prompt design.

Context Rot and the Dumb Zone: Engineering Past 100k Tokens

In June 2025, Chroma Research tested 18 frontier models and found that every single one degraded as input tokens grew. On one JSON-copy task at 30,000 input tokens, Claude Sonnet 4 reportedly fell from near-perfect accuracy to roughly zero.

Thirty thousand tokens. Not a million. Not even close to the advertised window.

This is context rot, and if you're building agents, it's the constraint that actually governs your architecture. The vendors sell you a 200k, 1M, or 10M token context window. The model gives you reliable attention over a fraction of it.

Context rot is the empirical decline in an LLM's accuracy, instruction-following, and reasoning quality as its active context grows, even when no information is truncated and the added tokens have nothing to do with the task. It begins well inside the advertised context window, and it affects every model tested.

TL;DR: Long context degradation is real, documented across academic benchmarks (Lost in the Middle, RULER, NoLiMa), vendor research, and production telemetry. Practitioners report a "dumb zone" starting around 50k-100k tokens where agents loop, hallucinate, and quit early. The fix isn't a bigger window. It's separating the inner loop (a small, reconstructed working context) from the outer loop (durable state in files, git, and memory stores).

Key takeaways

Chroma's study found all 18 tested models degrade as input grows, including 1M-window models like Gemini 2.5 Pro and GPT-4.1.
NVIDIA's RULER benchmark puts effective context at roughly one third of advertised context on harder tasks.
HumanLayer's analysis of ~100k coding-agent sessions found error rates and premature terminations spike past 50k-100k working tokens.
The "lost in the middle" U-shape is shallower in newer models but still measurable.
The architectural answer: keep the inner loop small, externalize state to the filesystem and git, delegate to sub-agents, and cache stable prefixes.

What is context rot, exactly?

The term was coined on Hacker News by user Workaccount2 and surfaced the same day by Simon Willison in June 2025. The original observation: "performance degrades as context size grows, even on tasks that have nothing to do with the content of the context."

That last clause is the important one. This isn't running out of room. It's a steady decline in attention quality while the window still has plenty of space.

The Chroma report made it rigorous. Across needle-in-haystack with distractors, JSON copy and corruption tasks, and multi-document reasoning, degradation was universal. The specific symptoms recur across models: verbatim repetition of phrases and tool calls, premature termination with empty or truncated responses, hallucinated "quotes" that don't exist in the context, and format collapse on structured output.

Three mechanisms appear to compound. Attention dilution spreads a fixed attention budget over more tokens. Position bias starves the middle of the context. And distractor interference means plausible-but-irrelevant content actively drags accuracy down, even when it's not part of the question.

One caveat for honesty's sake: the headline "near-zero at 30k" figure circulates widely from the Chroma study, but it's a single task on a single model. Treat it as an existence proof of how steep the cliff can get, not a universal constant.

Lost in the middle: the U-shaped curve

The foundational academic result is Liu et al.'s "Lost in the Middle" (2023, published in TACL 2024). Place a relevant fact at the start or end of a long context and models retrieve it reliably. Place it in the middle and accuracy drops 20 to 40 percentage points depending on model and task.

The mechanism is well understood. Decoder-only transformers over-attend to recent tokens (recency bias) and initial tokens (primacy, often attention-sink behavior). The middle gets the least attention.

Replications through 2024-2025 confirmed the shape while refining it. Newer models like GPT-4o, Claude 3.5/4, and Gemini 1.5 show a shallower U than the LLaMA-2 and GPT-3.5 era, but the curve hasn't disappeared.

And a 2025 paper argues the bias is partly a training artifact rather than purely architectural, which suggests it may keep shrinking but won't vanish on its own.

Two benchmarks sharpen the picture. NoLiMa rephrases retrieval tasks so they require reasoning over evidence instead of literal string matching; per the paper, most tested models fell below 50% accuracy even at modest context lengths, and very few crossed 50% at 64k and beyond. NVIDIA's RULER adds multi-hop tracing and aggregation, and shows effective context size running at roughly a third of the advertised window.

The RULER 128k leaderboard still shows no model reliably achieving its advertised length on the harder subtasks.

The lesson: needle-in-haystack scores systematically overstate real capability, because real agent work is never literal-matching retrieval.

Where does the 100k dumb zone come from?

The benchmark numbers are corroborated by production data. The most influential practitioner evidence comes from HumanLayer's analysis of roughly 100k coding-agent sessions, which introduced the term "dumb zone."

Once working context passes about 50k-100k tokens, Claude Code and similar tools produce dramatically more errors, more tool-call loops, and more premature terminations. That's well inside the 200k window the model technically supports.

Drew Breunig's "How Long Contexts Fail" taxonomy, now canonical in the agent-engineering community, names four failure modes:

Failure mode	What happens
Poisoning	A hallucination from earlier in the session gets treated as fact later
Distraction	Plausible-looking irrelevant content pulls outputs off-task
Confusion	The model loses track of which sub-task it's on and mixes instructions
Clash	Contradictory context (a stale variable name vs. A fresh one) causes oscillation

Poisoning is the nastiest one for long-running agents, because it compounds. Every turn that builds on a poisoned fact deepens the error, and compaction can bake the poison into the summary.

Doesn't a 1M-token context window fix this?

Partially, and it's worth being precise about which part. Google's Gemini 1.5 technical report demonstrated near-perfect needle-in-haystack retrieval up to 10M tokens, the strongest single piece of counter-evidence around. But that's the exact benchmark NoLiMa and RULER show to be the most flattering, and it was a research result rather than a production capacity.

Anthropic shipped a 1M-token Claude Sonnet 4.5 with prompt caching, and the RULER and NoLiMa trend lines through 2026 show genuine improvement at the frontier.

The honest synthesis, as of mid-2026: the gains are concentrated in literal-matching retrieval, the raw upper bound of the window, and cost and latency via caching. They are not concentrated in reasoning over long contexts, instruction-following at length, or agent-loop reliability.

Those are precisely the things production agents need. As the practitioner consensus across Willison's writing puts it: context is now cheap to send, but attention is still scarce.

Inner loop, outer loop: the architecture that actually works

The architectural response is to treat in-context state as a scarce, lossy resource and push durable state outside the model entirely.

Anthropic's "Building Effective AI Agents" (December 2024) is the most influential articulation. The inner loop is the LLM call: a constructed context, rebuilt each turn, kept small. The outer loop is everything durable: the filesystem, version control, structured memory, the record of decisions made and artifacts produced.

The follow-up "Scaling Managed Agents" post sharpens it into "decoupling the brain from the hands": the LLM decides in-context, while file edits and commands run in a checkpointed, resumable outer loop. Lose the context, resume from the checkpoint, never re-run the world.

In practice the outer loop is mostly files and git:

Canonical state files. Claude Code reads and updates a project-level CLAUDE.md; Aider uses conventions files; Cursor uses project rules. One human-readable file the agent treats as ground truth and edits incrementally.
Progress files. Anthropic's harness guidance and Manus's context-engineering writeup both describe aprogress.mdrecording the current goal, what was tried, what failed, and the next action. Crucially it's not a transcript. It's a compressed record the agent writes itself, and the next turn reconstructs context from it. Manus credits this pattern with maintaining coherence across 50+ tool-call sessions.
Git as memory. Aider commits after every successful edit. Git worktrees give each parallel sub-agent an isolated working directory, so its in-context state never bleeds into another's, and the commit history is the durable record.

Sub-agents are the inner-loop isolation pattern taken to its conclusion. Anthropic's multi-agent research system reported a 90.2% improvement over a single-agent baseline on a research task, at roughly 15x the token cost.

That cost is the price of keeping each worker's context narrow enough to stay reliable, and the orchestrator only ever sees summaries, never full transcripts.

The tactical checklist

Production agents combine all of these, not one:

Compaction, but not alone. Summarize and prune the working context past a threshold, as Cline's Auto Compact does. Anthropic's harness post is blunt that compaction "isn't sufficient on its own" without state externalization.
Structured note-taking. End every turn by writing a recap (goal, last action, last result, next action) to a file. Start every turn from that file plus fresh references, not the raw history.
Sub-agent delegation. Workers get narrow, fresh contexts; Anthropic's widely adopted heuristic is about 5-7 tool calls' worth per worker.
Prompt-cached stable prefixes. Pin system prompts, reference docs, and tool definitions in the cache. The inner loop re-processes only the delta, per Anthropic's context-engineering guidance.
Position-aware layout. Stable context at the start, critical current-state information (the goal, the latest error) at the very end. Never park must-attend facts in the middle.
Termination discipline. A max-iteration cap plus an explicit "am I done?" check, because at long context the model will sometimes return an empty response and your loop will read it as success.
Hybrid retrieval. Use RAG to inject relevant slices of large documents; reserve the window for working memory. Hamel Husain's notes on context rot make the case that retrieval and long context are complements, not rivals. For cross-session memory, layers like mem0 and Letta's tiered core/recall/archival model externalize facts beyond any single window.

What this means for you

Budget for usable context, not advertised context. If RULER says a third and your own telemetry says the dumb zone starts near 50k, design your inner loop to live comfortably under that, and treat anything beyond as cache territory for stable prefixes.

Instrument it. Track working-context size per turn and correlate it with retries, loops, and empty responses. You'll likely find your own dumb-zone threshold within a week of production traffic.

And stop reaching for the bigger window as the fix. The evidence from 2023 through 2026 is consistent: windows grew 50x, and the failure modes that matter for agents moved far less.

The teams shipping reliable long-running agents are the ones treating context as a cache to be managed, with the filesystem and git as the real memory. That's an engineering discipline, and it compounds, which is more than you can say for tokens in the middle of a million-token prompt.