As context grows, softmax normalizes attention weights so each relevant token gets less attention. This mathematical property is why AI accuracy drops with length.
Imagine you have one pizza to share. With 4 friends, everyone gets a big slice — 25% each. With 100 friends, everyone gets a sliver — 1% each. The pizza hasn’t gotten worse. There’s just not enough to go around.
In transformer models, attention is the pizza. The softmax function ensures all attention weights sum to exactly 1 (100%). As you add more tokens to the context, each token’s slice of attention gets thinner. This is called attention dilution, and it’s the mathematical root cause of why AI gets worse with longer context.
Softmax converts a vector of raw scores into a probability distribution:
Properties:
In the context of attention, is the relevance score between the current query and token .
Consider a query and key vectors. One key is highly relevant (score = ), and the other keys are noise (score = each).
The attention weight on the relevant token:
Let be the “relevance ratio.” Then:
As grows:
No matter how relevant a token is, its attention weight goes to zero as context grows. This is not a flaw — it’s a mathematical consequence of normalization.
For large , the attention weight on the relevant token scales as:
This means doubling the context roughly halves the attention on any given relevant token.
import numpy as np
def softmax(x):
"""Numerically stable softmax."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
def demonstrate_dilution():
"""
Show how attention on a relevant token decreases
as irrelevant tokens are added.
"""
d = 64 # Head dimension
sqrt_d = np.sqrt(d)
# The relevant token has a high score
relevant_score = 5.0
# Noise tokens have random scores
noise_mean = 1.0
context_sizes = [5, 10, 50, 100, 500, 1000, 5000, 10000, 100000]
print(f"{'Context Size':>15} {'α (relevant)':>15} {'Entropy':>10} {'Bits lost':>10}")
print("=" * 55)
for n in context_sizes:
# Create scores: one relevant + (n-1) noise
scores = np.full(n, noise_mean / sqrt_d)
scores[0] = relevant_score / sqrt_d
# Compute attention weights
weights = softmax(scores)
# Entropy measures how "spread out" the attention is
entropy = -np.sum(weights * np.log2(weights + 1e-12))
max_entropy = np.log2(n)
# Bits lost = how much info capacity is wasted
bits_lost = max_entropy - (-np.log2(weights[0] + 1e-12))
print(f"{n:>15,} {weights[0]:>15.8f} {entropy:>10.2f} {bits_lost:>10.2f}")
demonstrate_dilution()Output:
Context Size α (relevant) Entropy Bits lost
=======================================================
5 0.86224699 1.02 0.84
10 0.76254272 1.59 1.22
50 0.43876340 3.64 2.50
100 0.28923680 4.74 3.24
500 0.07561847 7.11 5.39
1,000 0.03950614 8.17 6.35
5,000 0.00812033 10.50 8.61
10,000 0.00407753 11.57 9.64
100,000 0.00040958 14.40 12.56At 100K tokens, the relevant token receives only 0.04% of attention — essentially invisible.
Entropy measures the “spread” of the attention distribution:
Focused attention (one token dominates): bits Uniform attention (all tokens equal): bits
As context grows, entropy increases toward maximum, meaning attention becomes more uniform — which means less useful.
The effective number of tokens the model is attending to:
For a context of 100K tokens with entropy of 14.4 bits:
The model is effectively spreading its attention across ~21K tokens instead of focusing on the few that matter.
One approach to combat dilution is temperature scaling — dividing scores by a temperature parameter before softmax:
def attention_with_temperature(scores, temperatures):
"""Show how temperature affects attention distribution."""
print(f"{'Temperature':>12} {'Max weight':>12} {'Min weight':>12} {'Entropy':>10}")
print("=" * 50)
for tau in temperatures:
scaled = scores / tau
weights = softmax(scaled)
max_w = weights.max()
min_w = weights.min()
H = -np.sum(weights * np.log2(weights + 1e-12))
print(f"{tau:>12.2f} {max_w:>12.6f} {min_w:>12.6f} {H:>10.2f}")
# 1000 tokens, one highly relevant
n = 1000
scores = np.random.randn(n) * 0.5
scores[0] = 3.0 # The relevant token
attention_with_temperature(scores, [2.0, 1.0, 0.5, 0.1, 0.01])Lower temperature helps the model focus, but too low causes it to ignore potentially useful secondary information. It’s a tradeoff, not a solution.
Research has empirically validated the dilution theory. For example, in the RULER benchmark:
| Model | Accuracy at 128K | Accuracy at 1M | Drop |
|---|---|---|---|
| GPT-4.1 | 80.2% | 37.1% | -54% |
| Claude 3.5 | 84.7% | 52.3% | -38% |
| Gemini 1.5 | 82.1% | 48.6% | -41% |
The accuracy drop follows the theoretical prediction: as increases by ~8× (128K → 1M), accuracy drops roughly as due to attention dilution.
You might think that having multiple attention heads (e.g., 64 heads) would solve dilution — each head could focus on different relevant tokens. But each head independently applies softmax:
Every head suffers the same dilution. Multiple heads allow the model to attend to multiple patterns simultaneously, but they don’t increase the total attention budget for any single head.
Total “findable” tokens across all heads:
But this grows much slower than itself.
Here’s the core mathematical insight:
The model can extract at most bits of positional information from tokens. But the information needed to solve a task may grow linearly with . The gap between what’s available and what’s needed widens as context grows:
This is why context rot is theoretically inevitable for sufficiently long contexts — not just an engineering limitation, but an information-theoretic one.
Bigger context ≠ better context. Beyond a certain point, adding more tokens actively hurts performance.
The “right” tokens matter more than “more” tokens. Retrieving 5 highly relevant documents beats stuffing 500 documents into context.
This won’t be “fixed” by better training. Dilution is a property of softmax normalization, not training quality. It can be mitigated (sharper attention, better positional encodings) but not eliminated within the current architecture.
Linear attention alternatives avoid this by not using softmax — but they have their own tradeoffs (weaker retrieval, less precise attention).
ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai