Context Rot in LLMs — Performance vs Context Size

MRCR v2, NIAH, NoLiMa, and LongSWE-Bench benchmarks across Claude, GPT, and Gemini models · March 2026

32K
Tokens where 10/12 models
drop below 50% baseline (NoLiMa)
Opus 4.6 improvement over
Sonnet 4.5 on MRCR @ 1M
90%
Coding quality loss
32K → 256K (Sonnet 3.5)
60%
Recommended max context
utilization before compaction
Key insight from NIAH researcher: Standard Needle-in-a-Haystack scores are misleading — GPT-4.1 scores 100% and Claude 3.5 Sonnet ~99% on lexical NIAH, but when NoLiMa removes keyword overlap, 11/12 models drop below 50% of baseline at just 32K tokens. The effective context window is typically 60–70% of the advertised maximum.

MRCR v2 (8-needle) — Score vs Context Size

Task-Specific Degradation Rate

Model Accuracy at Key Context Sizes (MRCR + NoLiMa)

Context Danger Zones — When to Compact

ZoneToken RangeRiskAction
Safe 0 – 8K All models at peak performance No action needed
Caution 8K – 32K Lost-in-middle effect begins (30%+ drop for middle info) Place key info at start/end
Degrade 32K – 64K Most models below 50% baseline on non-literal tasks Compact history, use RAG
High Risk 64K – 128K Coding 29% → 3%; response times spike Aggressively compact or /clear
Critical 128K+ Only Opus 4.6 & Gemini 2.5 Pro remain viable Retrieval-only tasks; no complex reasoning

Standard NIAH vs NoLiMa (Real-World Retrieval)

Lost-in-the-Middle — Recall by Needle Position (128K)

Sources

Michelangelo Benchmark (DeepMind) — arXiv:2409.12640
OpenAI MRCR v2 Dataset — Hugging Face
MRCR v2 (8-needle) Leaderboard — llm-stats.com
Introducing Claude Opus 4.6 — anthropic.com
Claude Sonnet 4.6 System Card — anthropic.com
Context Rot: Chroma Research — trychroma.com
NoLiMa: Beyond Literal Matching — arXiv:2502.05167
Context Length Hurts Despite Perfect Retrieval — arXiv:2510.05381
LongCodeBench (1M coding eval) — arXiv:2505.07897
Effective Context Engineering for AI Agents — anthropic.com
Lost in the Middle (Liu et al. 2023) — arXiv:2307.03172
Long-Context RAG: OpenAI o1 & Gemini — Databricks
RULER Benchmark (NVIDIA, COLM 2024) — arXiv:2404.06654
Greg Kamradt's NIAH Test — GitHub
Gemini 1.5 Technical Report — arXiv:2403.05530
Maximum Effective Context Window — arXiv:2509.21361