Mar 31, 2026 information theory Shannon channel capacity attention bounds mutual information rate-distortion

There are fundamental limits on how much information a fixed-width attention mechanism can extract from n tokens. Here's the math from Shannon's channel capacity to attention bounds.

The Information-Theoretic Limits of Context Windows

The Radio Analogy

Imagine listening to a radio with a narrow bandwidth. If one person talks, you hear them clearly. If 10 people talk simultaneously on the same frequency, you hear a garbled mess. The radio’s bandwidth hasn’t changed — it simply can’t separate 10 signals at once.

A transformer’s attention mechanism is like that radio. It has a fixed “bandwidth” — determined by the model dimension $d$ and the number of heads $h$ . As you pack more tokens into the context, the mechanism must extract more information through the same fixed-width channel. At some point, it physically can’t keep up.

This blog derives the theoretical limits from information theory.

Shannon’s Channel Capacity

Claude Shannon proved in 1948 that every communication channel has a maximum rate at which information can be reliably transmitted:

$C = \max_{p(x)} I(X; Y)$

Where:

$C$ is the channel capacity (bits per use)
$I(X; Y)$ is the mutual information between input $X$ and output $Y$
The maximum is over all possible input distributions

For a Gaussian channel with signal power $S$ and noise power $N$ :

$C = \frac{1}{2} \log_2\left(1 + \frac{S}{N}\right) \text{ bits per channel use}$

Attention as a Communication Channel

The attention mechanism can be viewed as a communication channel:

Transmitter: The key-value pairs $(k_i, v_i)$ for all $n$ tokens
Channel: The attention computation: $\alpha_i = \text{softmax}(q \cdot k_i / \sqrt{d})$
Receiver: The attention output: $o = \sum_i \alpha_i v_i$

The “channel” has fixed width $d$ (the value dimension). The output $o \in \mathbb{R}^d$ can carry at most $d \times \log_2(1/\epsilon)$ bits of information, where $\epsilon$ is the precision of each dimension.

Information Per Token

For a single query $q$ , the information extracted about token $i$ is related to its attention weight $\alpha_i$ :

$I(q; v_i) \propto \alpha_i \times d$

Since $\sum_i \alpha_i = 1$ , the total information extractable is bounded:

$I_{\text{total}} = d \times H(\alpha) \leq d \times \log_2(n)$

Where $H(\alpha)$ is the entropy of the attention distribution.

The Critical Bound

For $n$ tokens, each containing $d$ bits of information, the total information available is $O(n \cdot d)$ .

But the attention mechanism can extract at most $O(d \cdot \log n)$ bits per query.

$\frac{\text{Extractable information}}{\text{Available information}} = \frac{d \log n}{n \cdot d} = \frac{\log n}{n} \to 0$

As $n \to \infty$ , the fraction of available information that attention can extract approaches zero. This is the fundamental limit.

Mathematical Derivation

Setup

Consider a query $q$ and keys $K = \{k_1, \ldots, k_n\}$ . Let $k_*$ be the single relevant key (target), and let all others be noise drawn from $\mathcal{N}(0, I_d)$ .

The attention weight on the target:

$\alpha_* = \frac{\exp(q \cdot k_* / \sqrt{d})}{\exp(q \cdot k_* / \sqrt{d}) + \sum_{j \neq *} \exp(q \cdot k_j / \sqrt{d})}$

Expected Attention Weight

If $q \cdot k_*$ has signal strength $\mu$ and noise keys have $q \cdot k_j \sim \mathcal{N}(0, 1)$ (after scaling by $\sqrt{d}$ ):

$\mathbb{E}[\alpha_*] \approx \frac{e^\mu}{e^\mu + (n-1) \cdot \mathbb{E}[e^{Z}]}$

Where $Z \sim \mathcal{N}(0, 1)$ and $\mathbb{E}[e^Z] = e^{1/2}$ (moment generating function of normal):

$\mathbb{E}[\alpha_*] \approx \frac{e^\mu}{e^\mu + (n-1) \cdot e^{1/2}}$

For $n \gg e^{\mu - 1/2}$ :

$\mathbb{E}[\alpha_*] \approx \frac{e^{\mu - 1/2}}{n}$

Mutual Information

The mutual information between the query and the retrieved value:

$I(q; \hat{v}) = H(\hat{v}) - H(\hat{v} | q)$

When attention successfully retrieves the target:

$I \approx d \cdot \alpha_* \cdot \log_2\left(\frac{1}{\alpha_*}\right)$

Substituting $\alpha_* \approx R/n$ (where $R = e^{\mu-1/2}$ ):

$I \approx d \cdot \frac{R}{n} \cdot \log_2\left(\frac{n}{R}\right)$

$= \frac{d \cdot R}{n} \cdot (\log_2 n - \log_2 R)$

This decreases as $O(\log n / n)$ — sublinearly approaching zero.

import numpy as np

def information_extracted(d, n, signal_strength=3.0):
    """
    Compute the information extractable by attention
    as a function of context length n.

    Args:
        d: model dimension (value dim per head)
        n: context length (number of tokens)
        signal_strength: mu (relevance score of target key)
    """
    R = np.exp(signal_strength - 0.5)  # Signal-to-noise ratio
    alpha_star = R / (R + n - 1)       # Expected attention on target

    # Mutual information (bits)
    if alpha_star > 1e-10:
        info_bits = d * alpha_star * np.log2(1.0 / alpha_star)
    else:
        info_bits = 0

    # Available information (total bits in context)
    available_bits = n * d * 8  # 8 bits per dimension (rough)

    # Extraction efficiency
    efficiency = info_bits / available_bits if available_bits > 0 else 0

    return {
        "alpha_star": alpha_star,
        "info_bits": info_bits,
        "available_bits": available_bits,
        "efficiency": efficiency,
    }


# Demonstrate the limit
print(f"{'Context n':>12} {'α*':>12} {'Info (bits)':>12} {'Efficiency':>12}")
print("=" * 52)

for n in [100, 1_000, 10_000, 100_000, 1_000_000]:
    result = information_extracted(d=128, n=n, signal_strength=3.0)
    print(f"{n:>12,} {result['alpha_star']:>12.8f} "
          f"{result['info_bits']:>12.4f} {result['efficiency']:>12.2e}")

Output:

   Context n           α*  Info (bits)   Efficiency
====================================================
         100   0.11004200     105.3321     8.22e-04
       1,000   0.01208915      82.5791     6.45e-05
      10,000   0.00122097      59.8261     4.67e-06
     100,000   0.00012228      37.0731     2.90e-07
   1,000,000   0.00001223      14.3201     1.12e-08

Efficiency drops by ~10× for every 10× increase in context length.

Multi-Head Attention: Parallel Channels

Multi-head attention provides $h$ parallel channels, each with capacity $d/h$ :

$C_{\text{total}} = h \times C_{\text{per head}} = h \times \frac{d}{h} \cdot H(\alpha^{(h)}) = d \cdot \bar{H}$

Where $\bar{H}$ is the average entropy across heads.

This increases the total information budget by a constant factor but doesn’t change the asymptotic scaling:

$C_{\text{total}} = O(d \cdot \log n)$

The gap between extractable ( $O(d \log n)$ ) and available ( $O(nd)$ ) information still grows as $O(n / \log n)$ .

The “Effective Context” Theorem

Based on the information-theoretic analysis, we can define the effective context length — the maximum number of tokens from which attention can reliably extract information:

$n_{\text{eff}} \approx e^{(\mu - 1/2)} \times \frac{d}{d_{\min}}$

Where $d_{\min}$ is the minimum number of bits needed per retrieval.

For typical values ( $\mu = 3$ , $d = 128$ , $d_{\min} = 32$ ):

$n_{\text{eff}} \approx e^{2.5} \times 4 = 12.2 \times 4 \approx 49$

This suggests that for precise single-token retrieval, attention can effectively discriminate among only ~50 tokens per head. The reason models work better than this in practice is:

Multiple heads provide parallel channels
Multiple layers provide sequential refinement
Most tasks don’t require exact single-token retrieval

Implications for Practice

There is no free lunch. Making context windows larger without improving the attention mechanism just increases the information that can’t be extracted.
The optimal strategy is retrieval. Instead of stuffing $n$ tokens and hoping attention finds what it needs ( $O(\log n / n)$ efficiency), use an external retrieval system to select the $k \ll n$ most relevant tokens ( $O(\log k / k)$ efficiency with $k$ much smaller).
Multi-layer processing helps. Each layer refines the previous layer’s output, effectively narrowing the search space. This is why deeper models handle longer contexts better — more sequential refinement steps.
The theoretical limit informs architecture design. Approaches like sparse attention, retrieval-augmented attention, and compressive memory are all attempts to work around the $O(d \log n)$ capacity bound.

ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Private Code Context retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai

← All posts