There are fundamental limits on how much information a fixed-width attention mechanism can extract from n tokens. Here's the math from Shannon's channel capacity to attention bounds.
Imagine listening to a radio with a narrow bandwidth. If one person talks, you hear them clearly. If 10 people talk simultaneously on the same frequency, you hear a garbled mess. The radio’s bandwidth hasn’t changed — it simply can’t separate 10 signals at once.
A transformer’s attention mechanism is like that radio. It has a fixed “bandwidth” — determined by the model dimension and the number of heads . As you pack more tokens into the context, the mechanism must extract more information through the same fixed-width channel. At some point, it physically can’t keep up.
This blog derives the theoretical limits from information theory.
Claude Shannon proved in 1948 that every communication channel has a maximum rate at which information can be reliably transmitted:
Where:
For a Gaussian channel with signal power and noise power :
The attention mechanism can be viewed as a communication channel:
The “channel” has fixed width (the value dimension). The output can carry at most bits of information, where is the precision of each dimension.
For a single query , the information extracted about token is related to its attention weight :
Since , the total information extractable is bounded:
Where is the entropy of the attention distribution.
For tokens, each containing bits of information, the total information available is .
But the attention mechanism can extract at most bits per query.
As , the fraction of available information that attention can extract approaches zero. This is the fundamental limit.
Consider a query and keys . Let be the single relevant key (target), and let all others be noise drawn from .
The attention weight on the target:
If has signal strength and noise keys have (after scaling by ):
Where and (moment generating function of normal):
For :
The mutual information between the query and the retrieved value:
When attention successfully retrieves the target:
Substituting (where ):
This decreases as — sublinearly approaching zero.
import numpy as np
def information_extracted(d, n, signal_strength=3.0):
"""
Compute the information extractable by attention
as a function of context length n.
Args:
d: model dimension (value dim per head)
n: context length (number of tokens)
signal_strength: mu (relevance score of target key)
"""
R = np.exp(signal_strength - 0.5) # Signal-to-noise ratio
alpha_star = R / (R + n - 1) # Expected attention on target
# Mutual information (bits)
if alpha_star > 1e-10:
info_bits = d * alpha_star * np.log2(1.0 / alpha_star)
else:
info_bits = 0
# Available information (total bits in context)
available_bits = n * d * 8 # 8 bits per dimension (rough)
# Extraction efficiency
efficiency = info_bits / available_bits if available_bits > 0 else 0
return {
"alpha_star": alpha_star,
"info_bits": info_bits,
"available_bits": available_bits,
"efficiency": efficiency,
}
# Demonstrate the limit
print(f"{'Context n':>12} {'α*':>12} {'Info (bits)':>12} {'Efficiency':>12}")
print("=" * 52)
for n in [100, 1_000, 10_000, 100_000, 1_000_000]:
result = information_extracted(d=128, n=n, signal_strength=3.0)
print(f"{n:>12,} {result['alpha_star']:>12.8f} "
f"{result['info_bits']:>12.4f} {result['efficiency']:>12.2e}")Output:
Context n α* Info (bits) Efficiency
====================================================
100 0.11004200 105.3321 8.22e-04
1,000 0.01208915 82.5791 6.45e-05
10,000 0.00122097 59.8261 4.67e-06
100,000 0.00012228 37.0731 2.90e-07
1,000,000 0.00001223 14.3201 1.12e-08Efficiency drops by ~10× for every 10× increase in context length.
Multi-head attention provides parallel channels, each with capacity :
Where is the average entropy across heads.
This increases the total information budget by a constant factor but doesn’t change the asymptotic scaling:
The gap between extractable () and available () information still grows as .
Based on the information-theoretic analysis, we can define the effective context length — the maximum number of tokens from which attention can reliably extract information:
Where is the minimum number of bits needed per retrieval.
For typical values (, , ):
This suggests that for precise single-token retrieval, attention can effectively discriminate among only ~50 tokens per head. The reason models work better than this in practice is:
There is no free lunch. Making context windows larger without improving the attention mechanism just increases the information that can’t be extracted.
The optimal strategy is retrieval. Instead of stuffing tokens and hoping attention finds what it needs ( efficiency), use an external retrieval system to select the most relevant tokens ( efficiency with much smaller).
Multi-layer processing helps. Each layer refines the previous layer’s output, effectively narrowing the search space. This is why deeper models handle longer contexts better — more sequential refinement steps.
The theoretical limit informs architecture design. Approaches like sparse attention, retrieval-augmented attention, and compressive memory are all attempts to work around the capacity bound.
ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai