Mar 31, 2026 context rot AI degradation lost in the middle attention mechanism AI quality

You're not imagining it. Research tested 18 frontier models and every single one degrades as the conversation gets longer. Here's the science behind it.

Why Your AI Gets Dumber the Longer You Talk to It

You’re Not Imagining It

Have you ever had a long conversation with an AI where it started out brilliant — giving precise, thoughtful answers — and by message 30, it was repeating itself, contradicting earlier statements, or ignoring your instructions?

You’re not imagining it. And it’s not because the AI is “getting tired.”

Research has demonstrated this empirically: when scientists tested 18 frontier models (including GPT-4, Claude 3, Gemini, and Llama), every single one showed measurable quality degradation as context length increased. No exceptions.

Let’s understand why.

The Attention Budget: A Classroom Analogy

Think of the AI’s attention like a teacher in a classroom. With 5 students, the teacher can give each student significant individual attention — about 20% each:

$\text{Attention per student} = \frac{1}{5} = 20\%$

With 30 students:

$\text{Attention per student} = \frac{1}{30} \approx 3.3\%$

With 200 students:

$\text{Attention per student} = \frac{1}{200} = 0.5\%$

Each token in the context window is like a student demanding the AI’s attention. The more tokens there are, the less attention each one gets. This isn’t about the AI being lazy — it’s a mathematical constraint of how transformer models work.

The Mathematics of Attention Dilution

Inside every transformer model, attention weights are computed using the softmax function:

$\alpha_i = \frac{\exp(q \cdot k_i / \sqrt{d})}{\sum_{j=1}^{n} \exp(q \cdot k_j / \sqrt{d})}$

Where:

$q$ is the query vector (what the model is currently “looking for”)
$k_i$ is the key vector for token $i$ (how relevant token $i$ is)
$d$ is the dimension of the key vectors
$n$ is the total number of tokens in the context

The crucial property of softmax is that all weights must sum to 1:

$\sum_{i=1}^{n} \alpha_i = 1$

This means attention is a zero-sum game. If a new token gets attention, some other token loses attention. As $n$ grows, even the most relevant token’s attention weight $\alpha_i$ decreases because the denominator grows.

A Numerical Example

Let’s say you have 3 tokens with relevance scores $[5.0, 1.0, 1.0]$ and $\sqrt{d} = 1$ for simplicity:

$\alpha_1 = \frac{e^5}{e^5 + e^1 + e^1} = \frac{148.4}{148.4 + 2.7 + 2.7} = \frac{148.4}{153.8} = 0.965$

Token 1 gets 96.5% of the attention. Great — the model finds the relevant token easily.

Now add 997 irrelevant tokens (each with score 1.0), making $n = 1000$ :

$\alpha_1 = \frac{e^5}{e^5 + 999 \times e^1} = \frac{148.4}{148.4 + 2{,}697.3} = \frac{148.4}{2{,}845.7} = 0.052$

Token 1’s attention dropped from 96.5% to 5.2% — simply by adding irrelevant tokens to the context.

With $n = 1{,}000{,}000$ :

$\alpha_1 = \frac{148.4}{148.4 + 999{,}999 \times 2.718} = \frac{148.4}{2{,}718{,}145.9} \approx 0.000055$

Down to 0.0055%. The model can barely find the relevant token anymore.

See It in Code

Here’s a Python simulation that demonstrates attention dilution:

import numpy as np

def attention_weight(relevance_score, noise_score, n_total, d=64):
    """
    Compute the attention weight of a relevant token
    as the total context size grows.
    """
    sqrt_d = np.sqrt(d)

    # Score for the relevant token
    relevant = np.exp(relevance_score / sqrt_d)

    # Scores for all the noise tokens
    noise = (n_total - 1) * np.exp(noise_score / sqrt_d)

    # Softmax normalization
    attention = relevant / (relevant + noise)
    return attention

# Simulate attention dilution as context grows
context_sizes = [10, 100, 1_000, 10_000, 100_000, 1_000_000]

print(f"{'Context Size':>15} {'Attention Weight':>20} {'% of Max':>10}")
print("-" * 50)

for n in context_sizes:
    weight = attention_weight(
        relevance_score=5.0,  # The important token
        noise_score=1.0,      # Background noise
        n_total=n
    )
    print(f"{n:>15,} {weight:>20.8f} {weight*100:>9.4f}%")

Output:

   Context Size     Attention Weight     % of Max
--------------------------------------------------
             10           0.91393120   91.3931%
            100           0.53213820   53.2138%
          1,000           0.09670833    9.6708%
         10,000           0.01022744    1.0227%
        100,000           0.00102669    0.1027%
      1,000,000           0.00010271    0.0103%

With a million tokens, the model can only devote 0.01% of its attention to the most relevant token. That’s why it “forgets” your instructions.

The “Lost in the Middle” Effect

The attention dilution problem gets worse due to a phenomenon called “Lost in the Middle” (Liu et al., 2023). Researchers found that LLMs don’t distribute their diminished attention uniformly — they show a strong U-shaped bias:

High attention at the beginning of the context (primacy effect)
High attention at the end of the context (recency effect)
Low attention in the middle

import numpy as np

def simulate_positional_attention(n_tokens: int) -> np.ndarray:
    """
    Simulate the U-shaped attention curve.
    Tokens at the start and end get more attention than the middle.
    """
    positions = np.arange(n_tokens)
    normalized = positions / (n_tokens - 1)  # 0 to 1

    # U-shaped curve: high at edges, low in middle
    # Using a simple quadratic model: f(x) = 4(x - 0.5)^2
    positional_bias = 4 * (normalized - 0.5) ** 2

    # Add base attention and normalize
    attention = positional_bias + 0.2
    attention = attention / attention.sum()

    return attention

# Show attention at key positions for 1000-token context
n = 1000
attention = simulate_positional_attention(n)

positions = [0, 100, 250, 500, 750, 900, 999]
print(f"{'Position':>10} {'Relative Attention':>20}")
print("-" * 35)
for pos in positions:
    print(f"{pos:>10} {attention[pos]/attention.max():>20.4f}")

This means:

Information at the start of your prompt → best recall
Information at the end of your prompt → good recall
Information in the middle of your prompt → poor recall

The Practical Impact: Context Rot

As conversations grow longer, multiple degradation effects compound:

Attention dilution: Each token gets less attention (mathematical inevitability)
Positional bias: Middle content gets ignored (architectural property)
Instruction drift: System prompts and early instructions lose influence
Contradiction accumulation: With more text, the probability of contradictory signals increases

The combined effect is what engineers call “context rot” — a gradual decay in response quality that worsens monotonically with context length.

Quantifying the Decay

Research has shown retrieval accuracy follows roughly this pattern:

$\text{Accuracy}(n) \approx A_0 \times \left(\frac{n_0}{n}\right)^{\beta}$

Where:

$A_0$ is the baseline accuracy at reference length $n_0$
$n$ is the current context length
$\beta$ is the decay exponent (typically 0.1 to 0.3)

For a model with 80% accuracy at 128K tokens and $\beta = 0.2$ :

$\text{Accuracy at 1M} = 0.80 \times \left(\frac{128{,}000}{1{,}000{,}000}\right)^{0.2} = 0.80 \times 0.128^{0.2} = 0.80 \times 0.66 = 0.53$

Accuracy drops from 80% to 53% — a massive degradation.

What You Can Do About It

1. Start Fresh Conversations

The simplest fix: don’t let conversations get too long. When you notice quality degrading, start a new conversation and re-state your key context.

2. Put Important Information First and Last

Due to the U-shaped attention curve, place your most critical instructions at the very beginning (system prompt) and repeat key details near the end of your context.

3. Trim Unnecessary Context

If your conversation includes irrelevant tangents, long error logs, or outdated code snippets that are no longer relevant — remove them. Every unnecessary token dilutes attention from the tokens that matter.

4. Use Structured Retrieval

Instead of carrying everything in context, use a retrieval system that pulls in only the relevant information for each query. This keeps context small and focused.

# Bad: Stuff everything in context
context = entire_codebase + all_docs + full_conversation_history

# Good: Retrieve only what's relevant
relevant_code = retriever.search(user_query, top_k=5)
relevant_docs = doc_search.search(user_query, top_k=3)
recent_history = conversation[-5:]  # Last 5 turns only

context = relevant_code + relevant_docs + recent_history

The Fundamental Tradeoff

There’s an inherent tension in AI systems:

$\text{More context} \implies \text{More information available} \implies \text{Better potential answers}$

But also:

$\text{More context} \implies \text{More attention dilution} \implies \text{Worse ability to find what matters}$

The optimal point isn’t “maximum context” — it’s “just enough context.” That’s the point where you have all the information the model needs, but nothing more.

Finding that optimal point is the fundamental challenge of context management.

ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Private Code Context retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai

← All posts