Sparse Attention Architectures: Breaking O(n²) with O(n√n), O(n·log n), and O(n)

Multiple research approaches attack the quadratic bottleneck: Longformer, Reformer, Linformer, and linear attention. Here's the math behind each one.

Sparse Attention Architectures: Breaking O(n²) with O(n√n), O(n·log n), and O(n)

Sparse Attention: Breaking the Quadratic Barrier

The Networking Analogy

At a party of 100 people, having everyone talk to everyone requires 1002=10,000100^2 = 10{,}000 conversations. But do you really need all of them? What if:

This is the core idea behind sparse attention: don’t compute all n2n^2 interactions. Only compute the ones that matter.

The Standard Attention Baseline

Full self-attention:

A=softmax(QKTd)A = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)

Complexity: O(n2d)O(n^2 d) compute, O(n2)O(n^2) memory for the attention matrix.

Every sparse attention method asks: “Which entries of the n×nn \times n matrix can we skip?”

Longformer: Local + Global Attention

Longformer (Beltagy et al., 2020) combines sliding window attention with global tokens.

Sliding Window (Local) Attention

Each token attends only to a window of ww neighbors on each side:

Aijlocal={softmax(qikj/d)if ijw0otherwiseA^{\text{local}}_{ij} = \begin{cases} \text{softmax}(q_i \cdot k_j / \sqrt{d}) & \text{if } |i - j| \leq w \\ 0 & \text{otherwise} \end{cases}

Complexity: O(nw)O(n \cdot w) instead of O(n2)O(n^2).

Global Attention

Selected tokens (e.g., [CLS], task-specific tokens) attend to ALL tokens:

Aijglobal=softmax(qikj/d)for global token iA^{\text{global}}_{ij} = \text{softmax}(q_i \cdot k_j / \sqrt{d}) \quad \text{for global token } i

If there are gg global tokens: O(ng)O(n \cdot g).

Total Longformer Complexity

O(n(w+g))O(n \cdot (w + g))

For w=512w = 512 and g=8g = 8, at n=100,000n = 100{,}000:

Full attention: 100,0002=1010\text{Full attention: } 100{,}000^2 = 10^{10}

Longformer: 100,000×(512+8)=5.2×107\text{Longformer: } 100{,}000 \times (512 + 8) = 5.2 \times 10^{7}

~200× reduction!

import numpy as np

def longformer_attention_mask(n: int, window_size: int,
                               global_indices: list) -> np.ndarray:
    """
    Create a Longformer-style attention mask.

    1 = can attend, 0 = cannot attend
    """
    mask = np.zeros((n, n), dtype=np.float32)

    # Sliding window: each token attends to ±window_size
    for i in range(n):
        left = max(0, i - window_size)
        right = min(n, i + window_size + 1)
        mask[i, left:right] = 1.0

    # Global tokens: attend to and from everywhere
    for g in global_indices:
        mask[g, :] = 1.0  # Global token attends to all
        mask[:, g] = 1.0  # All tokens attend to global

    return mask

# Example: 20 tokens, window=3, 2 global tokens
mask = longformer_attention_mask(20, window_size=3, global_indices=[0, 19])
density = mask.sum() / (20 * 20)
print(f"Attention density: {density:.1%} (vs 100% for full attention)")

Linformer: Low-Rank Projection

Linformer (Wang et al., 2020) observes that the attention matrix is approximately low-rank — meaning most of its information can be captured by a much smaller matrix.

Instead of computing QKTRn×nQK^T \in \mathbb{R}^{n \times n}, project K and V down to dimension knk \ll n:

K~=EKKRk×d,V~=EVVRk×d\tilde{K} = E_K \cdot K \in \mathbb{R}^{k \times d}, \quad \tilde{V} = E_V \cdot V \in \mathbb{R}^{k \times d}

Where EK,EVRk×nE_K, E_V \in \mathbb{R}^{k \times n} are learned projection matrices.

Attention becomes:

Attention=softmax(QK~Td)V~\text{Attention} = \text{softmax}\left(\frac{Q \tilde{K}^T}{\sqrt{d}}\right) \tilde{V}

Now QK~TRn×kQ\tilde{K}^T \in \mathbb{R}^{n \times k} instead of Rn×n\mathbb{R}^{n \times n}.

Complexity: O(nkd)O(n \cdot k \cdot d) — linear in nn!

def linformer_attention(Q, K, V, projection_dim=256):
    """
    Linformer: project K and V to lower dimension.

    Q: (n, d), K: (n, d), V: (n, d)
    Output: (n, d)
    Complexity: O(n * k * d) instead of O(n² * d)
    """
    n, d = Q.shape
    k = projection_dim

    # Learned projection matrices (initialized randomly here)
    E_K = np.random.randn(k, n) / np.sqrt(k)
    E_V = np.random.randn(k, n) / np.sqrt(k)

    # Project K and V: (k, n) @ (n, d) = (k, d)
    K_proj = E_K @ K  # (k, d)
    V_proj = E_V @ V  # (k, d)

    # Attention: Q @ K_proj^T = (n, d) @ (d, k) = (n, k) — LINEAR!
    scores = Q @ K_proj.T / np.sqrt(d)  # (n, k) instead of (n, n)

    # Softmax over k (much smaller than n)
    weights = np.exp(scores - scores.max(axis=-1, keepdims=True))
    weights = weights / weights.sum(axis=-1, keepdims=True)

    # Output: (n, k) @ (k, d) = (n, d)
    output = weights @ V_proj

    return output

Reformer: LSH Attention

Reformer (Kitaev et al., 2020) uses Locality-Sensitive Hashing (LSH) to group similar queries and keys, then only computes attention within groups.

The idea: if qiq_i and kjk_j would have a high attention score, their hash values are likely to collide.

Complexity: O(nlogn)O(n \cdot \log n) with high probability.

h(x)=sign(Rx)where R is a random matrixh(x) = \text{sign}(Rx) \quad \text{where } R \text{ is a random matrix}

Tokens with the same hash are grouped into buckets. Attention is computed only within buckets.

Linear Attention: The O(n) Solution

Linear attention (Katharopoulos et al., 2020) replaces softmax with a kernel function ϕ\phi:

Standard: softmax(QKT)V\text{softmax}(QK^T)V

Linear: ϕ(Q)(ϕ(K)TV)\phi(Q)(\phi(K)^T V)

The key trick is changing the order of operations:

Standard: (QKT)V(QK^T)V → compute QKTQK^T first (n×nn \times n), then multiply by VV

Linear: Q(KTV)Q(K^T V) → compute KTVK^T V first (d×dd \times d), then multiply by QQ

Standard: (QKT)n×nVO(n2d)\text{Standard: } \underbrace{(QK^T)}_{n \times n} V \quad O(n^2 d)

Linear: Q(KTV)d×dO(nd2)\text{Linear: } Q \underbrace{(K^T V)}_{d \times d} \quad O(n d^2)

Since dnd \ll n for long contexts, this is a massive improvement!

def linear_attention(Q, K, V, eps=1e-6):
    """
    Linear attention using ELU+1 kernel.

    Instead of softmax(QK^T)V, compute Q(K^T V).
    Complexity: O(n * d²) instead of O(n² * d)
    """
    # Kernel function: φ(x) = ELU(x) + 1
    def phi(x):
        return np.maximum(x, 0) + np.minimum(np.exp(x) - 1, 0) + 1

    Q_prime = phi(Q)  # (n, d)
    K_prime = phi(K)  # (n, d)

    # Key insight: compute K^T V first (d × d matrix)
    KV = K_prime.T @ V  # (d, d) — independent of n!

    # Then multiply by Q
    numerator = Q_prime @ KV  # (n, d)

    # Normalization (replaces softmax denominator)
    K_sum = K_prime.sum(axis=0, keepdims=True)  # (1, d)
    denominator = Q_prime @ K_sum.T  # (n, 1)

    output = numerator / (denominator + eps)
    return output

# Compare complexities
n_values = [1000, 10000, 100000, 1000000]
d = 128

print(f"{'n':>10} {'O(n²d)':>15} {'O(nd²)':>15} {'Speedup':>10}")
print("=" * 55)
for n in n_values:
    quad = n * n * d
    linear = n * d * d
    print(f"{n:>10,} {quad:>15,.0f} {linear:>15,.0f} {quad/linear:>10,.0f}×")

Output:

         n          O(n²d)          O(nd²)    Speedup
=======================================================
     1,000     128,000,000      16,384,000          8×
    10,000  12,800,000,000     163,840,000         78×
   100,000    1.28e+12      1,638,400,000        781×
 1,000,000    1.28e+14     16,384,000,000      7,813×

Comparison Table

MethodComplexityMemoryQuality vs FullUsed In
Full attentionO(n2d)O(n^2 d)O(n2)O(n^2)BaselineGPT-4, Claude
LongformerO(n(w+g))O(n(w+g))O(n(w+g))O(n(w+g))~95-98%Long document tasks
LinformerO(nkd)O(nkd)O(nk)O(nk)~93-96%Research
ReformerO(nlogn)O(n \log n)O(nlogn)O(n \log n)~92-95%Research
Linear attentionO(nd2)O(nd^2)O(nd)O(nd)~88-93%Mamba-hybrid models
Flash AttentionO(n2d)O(n^2 d)O(n)O(n)100% (exact)Everywhere

Note: Flash Attention doesn’t reduce compute — it reduces memory. Sparse methods reduce both but sacrifice quality.

Why Production Models Still Use Full Attention

Despite these innovations, most production LLMs (GPT-4, Claude, Gemini, Llama) use full attention with Flash Attention rather than sparse alternatives. Why?

  1. Quality gap matters at scale. A 2-5% accuracy drop seems small but compounds across billions of queries.
  2. Flash Attention solved the memory problem. The main motivation for sparse attention was memory, and Flash Attention handles that.
  3. Hardware is improving faster than the quality gap is closing. H100s are 2-3× faster than A100s; the next generation will be faster still.
  4. Hybrid approaches are emerging. Models like Jamba use attention for some layers and linear attention (Mamba) for others — getting the best of both worlds.

The future likely isn’t “sparse attention replaces full attention” but rather “intelligent combination of both.”


ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai