The Information-Theoretic Limits of Context Windows

There are fundamental limits on how much information a fixed-width attention mechanism can extract from n tokens. Here's the math from Shannon's channel capacity to attention bounds.

The Information-Theoretic Limits of Context Windows

The Information-Theoretic Limits of Context Windows

The Radio Analogy

Imagine listening to a radio with a narrow bandwidth. If one person talks, you hear them clearly. If 10 people talk simultaneously on the same frequency, you hear a garbled mess. The radio’s bandwidth hasn’t changed — it simply can’t separate 10 signals at once.

A transformer’s attention mechanism is like that radio. It has a fixed “bandwidth” — determined by the model dimension dd and the number of heads hh. As you pack more tokens into the context, the mechanism must extract more information through the same fixed-width channel. At some point, it physically can’t keep up.

This blog derives the theoretical limits from information theory.

Shannon’s Channel Capacity

Claude Shannon proved in 1948 that every communication channel has a maximum rate at which information can be reliably transmitted:

C=maxp(x)I(X;Y)C = \max_{p(x)} I(X; Y)

Where:

For a Gaussian channel with signal power SS and noise power NN:

C=12log2(1+SN) bits per channel useC = \frac{1}{2} \log_2\left(1 + \frac{S}{N}\right) \text{ bits per channel use}

Attention as a Communication Channel

The attention mechanism can be viewed as a communication channel:

The “channel” has fixed width dd (the value dimension). The output oRdo \in \mathbb{R}^d can carry at most d×log2(1/ϵ)d \times \log_2(1/\epsilon) bits of information, where ϵ\epsilon is the precision of each dimension.

Information Per Token

For a single query qq, the information extracted about token ii is related to its attention weight αi\alpha_i:

I(q;vi)αi×dI(q; v_i) \propto \alpha_i \times d

Since iαi=1\sum_i \alpha_i = 1, the total information extractable is bounded:

Itotal=d×H(α)d×log2(n)I_{\text{total}} = d \times H(\alpha) \leq d \times \log_2(n)

Where H(α)H(\alpha) is the entropy of the attention distribution.

The Critical Bound

For nn tokens, each containing dd bits of information, the total information available is O(nd)O(n \cdot d).

But the attention mechanism can extract at most O(dlogn)O(d \cdot \log n) bits per query.

Extractable informationAvailable information=dlognnd=lognn0\frac{\text{Extractable information}}{\text{Available information}} = \frac{d \log n}{n \cdot d} = \frac{\log n}{n} \to 0

As nn \to \infty, the fraction of available information that attention can extract approaches zero. This is the fundamental limit.

Mathematical Derivation

Setup

Consider a query qq and keys K={k1,,kn}K = \{k_1, \ldots, k_n\}. Let kk_* be the single relevant key (target), and let all others be noise drawn from N(0,Id)\mathcal{N}(0, I_d).

The attention weight on the target:

α=exp(qk/d)exp(qk/d)+jexp(qkj/d)\alpha_* = \frac{\exp(q \cdot k_* / \sqrt{d})}{\exp(q \cdot k_* / \sqrt{d}) + \sum_{j \neq *} \exp(q \cdot k_j / \sqrt{d})}

Expected Attention Weight

If qkq \cdot k_* has signal strength μ\mu and noise keys have qkjN(0,1)q \cdot k_j \sim \mathcal{N}(0, 1) (after scaling by d\sqrt{d}):

E[α]eμeμ+(n1)E[eZ]\mathbb{E}[\alpha_*] \approx \frac{e^\mu}{e^\mu + (n-1) \cdot \mathbb{E}[e^{Z}]}

Where ZN(0,1)Z \sim \mathcal{N}(0, 1) and E[eZ]=e1/2\mathbb{E}[e^Z] = e^{1/2} (moment generating function of normal):

E[α]eμeμ+(n1)e1/2\mathbb{E}[\alpha_*] \approx \frac{e^\mu}{e^\mu + (n-1) \cdot e^{1/2}}

For neμ1/2n \gg e^{\mu - 1/2}:

E[α]eμ1/2n\mathbb{E}[\alpha_*] \approx \frac{e^{\mu - 1/2}}{n}

Mutual Information

The mutual information between the query and the retrieved value:

I(q;v^)=H(v^)H(v^q)I(q; \hat{v}) = H(\hat{v}) - H(\hat{v} | q)

When attention successfully retrieves the target:

Idαlog2(1α)I \approx d \cdot \alpha_* \cdot \log_2\left(\frac{1}{\alpha_*}\right)

Substituting αR/n\alpha_* \approx R/n (where R=eμ1/2R = e^{\mu-1/2}):

IdRnlog2(nR)I \approx d \cdot \frac{R}{n} \cdot \log_2\left(\frac{n}{R}\right)

=dRn(log2nlog2R)= \frac{d \cdot R}{n} \cdot (\log_2 n - \log_2 R)

This decreases as O(logn/n)O(\log n / n) — sublinearly approaching zero.

import numpy as np

def information_extracted(d, n, signal_strength=3.0):
    """
    Compute the information extractable by attention
    as a function of context length n.

    Args:
        d: model dimension (value dim per head)
        n: context length (number of tokens)
        signal_strength: mu (relevance score of target key)
    """
    R = np.exp(signal_strength - 0.5)  # Signal-to-noise ratio
    alpha_star = R / (R + n - 1)       # Expected attention on target

    # Mutual information (bits)
    if alpha_star > 1e-10:
        info_bits = d * alpha_star * np.log2(1.0 / alpha_star)
    else:
        info_bits = 0

    # Available information (total bits in context)
    available_bits = n * d * 8  # 8 bits per dimension (rough)

    # Extraction efficiency
    efficiency = info_bits / available_bits if available_bits > 0 else 0

    return {
        "alpha_star": alpha_star,
        "info_bits": info_bits,
        "available_bits": available_bits,
        "efficiency": efficiency,
    }


# Demonstrate the limit
print(f"{'Context n':>12} {'α*':>12} {'Info (bits)':>12} {'Efficiency':>12}")
print("=" * 52)

for n in [100, 1_000, 10_000, 100_000, 1_000_000]:
    result = information_extracted(d=128, n=n, signal_strength=3.0)
    print(f"{n:>12,} {result['alpha_star']:>12.8f} "
          f"{result['info_bits']:>12.4f} {result['efficiency']:>12.2e}")

Output:

   Context n           α*  Info (bits)   Efficiency
====================================================
         100   0.11004200     105.3321     8.22e-04
       1,000   0.01208915      82.5791     6.45e-05
      10,000   0.00122097      59.8261     4.67e-06
     100,000   0.00012228      37.0731     2.90e-07
   1,000,000   0.00001223      14.3201     1.12e-08

Efficiency drops by ~10× for every 10× increase in context length.

Multi-Head Attention: Parallel Channels

Multi-head attention provides hh parallel channels, each with capacity d/hd/h:

Ctotal=h×Cper head=h×dhH(α(h))=dHˉC_{\text{total}} = h \times C_{\text{per head}} = h \times \frac{d}{h} \cdot H(\alpha^{(h)}) = d \cdot \bar{H}

Where Hˉ\bar{H} is the average entropy across heads.

This increases the total information budget by a constant factor but doesn’t change the asymptotic scaling:

Ctotal=O(dlogn)C_{\text{total}} = O(d \cdot \log n)

The gap between extractable (O(dlogn)O(d \log n)) and available (O(nd)O(nd)) information still grows as O(n/logn)O(n / \log n).

The “Effective Context” Theorem

Based on the information-theoretic analysis, we can define the effective context length — the maximum number of tokens from which attention can reliably extract information:

neffe(μ1/2)×ddminn_{\text{eff}} \approx e^{(\mu - 1/2)} \times \frac{d}{d_{\min}}

Where dmind_{\min} is the minimum number of bits needed per retrieval.

For typical values (μ=3\mu = 3, d=128d = 128, dmin=32d_{\min} = 32):

neffe2.5×4=12.2×449n_{\text{eff}} \approx e^{2.5} \times 4 = 12.2 \times 4 \approx 49

This suggests that for precise single-token retrieval, attention can effectively discriminate among only ~50 tokens per head. The reason models work better than this in practice is:

  1. Multiple heads provide parallel channels
  2. Multiple layers provide sequential refinement
  3. Most tasks don’t require exact single-token retrieval

Implications for Practice

  1. There is no free lunch. Making context windows larger without improving the attention mechanism just increases the information that can’t be extracted.

  2. The optimal strategy is retrieval. Instead of stuffing nn tokens and hoping attention finds what it needs (O(logn/n)O(\log n / n) efficiency), use an external retrieval system to select the knk \ll n most relevant tokens (O(logk/k)O(\log k / k) efficiency with kk much smaller).

  3. Multi-layer processing helps. Each layer refines the previous layer’s output, effectively narrowing the search space. This is why deeper models handle longer contexts better — more sequential refinement steps.

  4. The theoretical limit informs architecture design. Approaches like sparse attention, retrieval-augmented attention, and compressive memory are all attempts to work around the O(dlogn)O(d \log n) capacity bound.


ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai