Mar 31, 2026 sparse attention Longformer Linformer linear attention Reformer efficient transformers

Multiple research approaches attack the quadratic bottleneck: Longformer, Reformer, Linformer, and linear attention. Here's the math behind each one.

Sparse Attention Architectures: Breaking O(n²) with O(n√n), O(n·log n), and O(n)

Sparse Attention: Breaking the Quadratic Barrier

The Networking Analogy

At a party of 100 people, having everyone talk to everyone requires $100^2 = 10{,}000$ conversations. But do you really need all of them? What if:

Everyone talks to their 5 nearest neighbors → 500 conversations (local attention)
Plus, 3 designated hosts talk to everyone → 300 more (global attention)
Total: 800 instead of 10,000 — a 12.5× reduction

This is the core idea behind sparse attention: don’t compute all $n^2$ interactions. Only compute the ones that matter.

The Standard Attention Baseline

Full self-attention:

$A = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)$

Complexity: $O(n^2 d)$ compute, $O(n^2)$ memory for the attention matrix.

Every sparse attention method asks: “Which entries of the $n \times n$ matrix can we skip?”

Longformer: Local + Global Attention

Longformer (Beltagy et al., 2020) combines sliding window attention with global tokens.

Sliding Window (Local) Attention

Each token attends only to a window of $w$ neighbors on each side:

$A^{\text{local}}_{ij} = \begin{cases} \text{softmax}(q_i \cdot k_j / \sqrt{d}) & \text{if } |i - j| \leq w \\ 0 & \text{otherwise} \end{cases}$

Complexity: $O(n \cdot w)$ instead of $O(n^2)$ .

Global Attention

Selected tokens (e.g., [CLS], task-specific tokens) attend to ALL tokens:

$A^{\text{global}}_{ij} = \text{softmax}(q_i \cdot k_j / \sqrt{d}) \quad \text{for global token } i$

If there are $g$ global tokens: $O(n \cdot g)$ .

Total Longformer Complexity

$O(n \cdot (w + g))$

For $w = 512$ and $g = 8$ , at $n = 100{,}000$ :

$\text{Full attention: } 100{,}000^2 = 10^{10}$

$\text{Longformer: } 100{,}000 \times (512 + 8) = 5.2 \times 10^{7}$

~200× reduction!

import numpy as np

def longformer_attention_mask(n: int, window_size: int,
                               global_indices: list) -> np.ndarray:
    """
    Create a Longformer-style attention mask.

    1 = can attend, 0 = cannot attend
    """
    mask = np.zeros((n, n), dtype=np.float32)

    # Sliding window: each token attends to ±window_size
    for i in range(n):
        left = max(0, i - window_size)
        right = min(n, i + window_size + 1)
        mask[i, left:right] = 1.0

    # Global tokens: attend to and from everywhere
    for g in global_indices:
        mask[g, :] = 1.0  # Global token attends to all
        mask[:, g] = 1.0  # All tokens attend to global

    return mask

# Example: 20 tokens, window=3, 2 global tokens
mask = longformer_attention_mask(20, window_size=3, global_indices=[0, 19])
density = mask.sum() / (20 * 20)
print(f"Attention density: {density:.1%} (vs 100% for full attention)")

Linformer: Low-Rank Projection

Linformer (Wang et al., 2020) observes that the attention matrix is approximately low-rank — meaning most of its information can be captured by a much smaller matrix.

Instead of computing $QK^T \in \mathbb{R}^{n \times n}$ , project K and V down to dimension $k \ll n$ :

$\tilde{K} = E_K \cdot K \in \mathbb{R}^{k \times d}, \quad \tilde{V} = E_V \cdot V \in \mathbb{R}^{k \times d}$

Where $E_K, E_V \in \mathbb{R}^{k \times n}$ are learned projection matrices.

Attention becomes:

$\text{Attention} = \text{softmax}\left(\frac{Q \tilde{K}^T}{\sqrt{d}}\right) \tilde{V}$

Now $Q\tilde{K}^T \in \mathbb{R}^{n \times k}$ instead of $\mathbb{R}^{n \times n}$ .

Complexity: $O(n \cdot k \cdot d)$ — linear in $n$ !

def linformer_attention(Q, K, V, projection_dim=256):
    """
    Linformer: project K and V to lower dimension.

    Q: (n, d), K: (n, d), V: (n, d)
    Output: (n, d)
    Complexity: O(n * k * d) instead of O(n² * d)
    """
    n, d = Q.shape
    k = projection_dim

    # Learned projection matrices (initialized randomly here)
    E_K = np.random.randn(k, n) / np.sqrt(k)
    E_V = np.random.randn(k, n) / np.sqrt(k)

    # Project K and V: (k, n) @ (n, d) = (k, d)
    K_proj = E_K @ K  # (k, d)
    V_proj = E_V @ V  # (k, d)

    # Attention: Q @ K_proj^T = (n, d) @ (d, k) = (n, k) — LINEAR!
    scores = Q @ K_proj.T / np.sqrt(d)  # (n, k) instead of (n, n)

    # Softmax over k (much smaller than n)
    weights = np.exp(scores - scores.max(axis=-1, keepdims=True))
    weights = weights / weights.sum(axis=-1, keepdims=True)

    # Output: (n, k) @ (k, d) = (n, d)
    output = weights @ V_proj

    return output

Reformer: LSH Attention

Reformer (Kitaev et al., 2020) uses Locality-Sensitive Hashing (LSH) to group similar queries and keys, then only computes attention within groups.

The idea: if $q_i$ and $k_j$ would have a high attention score, their hash values are likely to collide.

Complexity: $O(n \cdot \log n)$ with high probability.

$h(x) = \text{sign}(Rx) \quad \text{where } R \text{ is a random matrix}$

Tokens with the same hash are grouped into buckets. Attention is computed only within buckets.

Linear Attention: The O(n) Solution

Linear attention (Katharopoulos et al., 2020) replaces softmax with a kernel function $\phi$ :

Standard: $\text{softmax}(QK^T)V$

Linear: $\phi(Q)(\phi(K)^T V)$

The key trick is changing the order of operations:

Standard: $(QK^T)V$ → compute $QK^T$ first ( $n \times n$ ), then multiply by $V$

Linear: $Q(K^T V)$ → compute $K^T V$ first ( $d \times d$ ), then multiply by $Q$

$\text{Standard: } \underbrace{(QK^T)}_{n \times n} V \quad O(n^2 d)$

$\text{Linear: } Q \underbrace{(K^T V)}_{d \times d} \quad O(n d^2)$

Since $d \ll n$ for long contexts, this is a massive improvement!

def linear_attention(Q, K, V, eps=1e-6):
    """
    Linear attention using ELU+1 kernel.

    Instead of softmax(QK^T)V, compute Q(K^T V).
    Complexity: O(n * d²) instead of O(n² * d)
    """
    # Kernel function: φ(x) = ELU(x) + 1
    def phi(x):
        return np.maximum(x, 0) + np.minimum(np.exp(x) - 1, 0) + 1

    Q_prime = phi(Q)  # (n, d)
    K_prime = phi(K)  # (n, d)

    # Key insight: compute K^T V first (d × d matrix)
    KV = K_prime.T @ V  # (d, d) — independent of n!

    # Then multiply by Q
    numerator = Q_prime @ KV  # (n, d)

    # Normalization (replaces softmax denominator)
    K_sum = K_prime.sum(axis=0, keepdims=True)  # (1, d)
    denominator = Q_prime @ K_sum.T  # (n, 1)

    output = numerator / (denominator + eps)
    return output

# Compare complexities
n_values = [1000, 10000, 100000, 1000000]
d = 128

print(f"{'n':>10} {'O(n²d)':>15} {'O(nd²)':>15} {'Speedup':>10}")
print("=" * 55)
for n in n_values:
    quad = n * n * d
    linear = n * d * d
    print(f"{n:>10,} {quad:>15,.0f} {linear:>15,.0f} {quad/linear:>10,.0f}×")

Output:

         n          O(n²d)          O(nd²)    Speedup
=======================================================
     1,000     128,000,000      16,384,000          8×
    10,000  12,800,000,000     163,840,000         78×
   100,000    1.28e+12      1,638,400,000        781×
 1,000,000    1.28e+14     16,384,000,000      7,813×

Comparison Table

Method	Complexity	Memory	Quality vs Full	Used In
Full attention	$O(n^2 d)$	$O(n^2)$	Baseline	GPT-4, Claude
Longformer	$O(n(w+g))$	$O(n(w+g))$	~95-98%	Long document tasks
Linformer	$O(nkd)$	$O(nk)$	~93-96%	Research
Reformer	$O(n \log n)$	$O(n \log n)$	~92-95%	Research
Linear attention	$O(nd^2)$	$O(nd)$	~88-93%	Mamba-hybrid models
Flash Attention	$O(n^2 d)$	$O(n)$	100% (exact)	Everywhere

Note: Flash Attention doesn’t reduce compute — it reduces memory. Sparse methods reduce both but sacrifice quality.

Why Production Models Still Use Full Attention

Despite these innovations, most production LLMs (GPT-4, Claude, Gemini, Llama) use full attention with Flash Attention rather than sparse alternatives. Why?

Quality gap matters at scale. A 2-5% accuracy drop seems small but compounds across billions of queries.
Flash Attention solved the memory problem. The main motivation for sparse attention was memory, and Flash Attention handles that.
Hardware is improving faster than the quality gap is closing. H100s are 2-3× faster than A100s; the next generation will be faster still.
Hybrid approaches are emerging. Models like Jamba use attention for some layers and linear attention (Mamba) for others — getting the best of both worlds.

The future likely isn’t “sparse attention replaces full attention” but rather “intelligent combination of both.”

ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Private Code Context retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai

← All posts