Sparse Attention: Breaking the Quadratic Barrier
The Networking Analogy
At a party of 100 people, having everyone talk to everyone requires conversations. But do you really need all of them? What if:
- Everyone talks to their 5 nearest neighbors → 500 conversations (local attention)
- Plus, 3 designated hosts talk to everyone → 300 more (global attention)
- Total: 800 instead of 10,000 — a 12.5× reduction
This is the core idea behind sparse attention: don’t compute all interactions. Only compute the ones that matter.
The Standard Attention Baseline
Full self-attention:
Complexity: compute, memory for the attention matrix.
Every sparse attention method asks: “Which entries of the matrix can we skip?”
Longformer: Local + Global Attention
Longformer (Beltagy et al., 2020) combines sliding window attention with global tokens.
Sliding Window (Local) Attention
Each token attends only to a window of neighbors on each side:
Complexity: instead of .
Global Attention
Selected tokens (e.g., [CLS], task-specific tokens) attend to ALL tokens:
If there are global tokens: .
Total Longformer Complexity
For and , at :
~200× reduction!
import numpy as np
def longformer_attention_mask(n: int, window_size: int,
global_indices: list) -> np.ndarray:
"""
Create a Longformer-style attention mask.
1 = can attend, 0 = cannot attend
"""
mask = np.zeros((n, n), dtype=np.float32)
# Sliding window: each token attends to ±window_size
for i in range(n):
left = max(0, i - window_size)
right = min(n, i + window_size + 1)
mask[i, left:right] = 1.0
# Global tokens: attend to and from everywhere
for g in global_indices:
mask[g, :] = 1.0 # Global token attends to all
mask[:, g] = 1.0 # All tokens attend to global
return mask
# Example: 20 tokens, window=3, 2 global tokens
mask = longformer_attention_mask(20, window_size=3, global_indices=[0, 19])
density = mask.sum() / (20 * 20)
print(f"Attention density: {density:.1%} (vs 100% for full attention)")Linformer: Low-Rank Projection
Linformer (Wang et al., 2020) observes that the attention matrix is approximately low-rank — meaning most of its information can be captured by a much smaller matrix.
Instead of computing , project K and V down to dimension :
Where are learned projection matrices.
Attention becomes:
Now instead of .
Complexity: — linear in !
def linformer_attention(Q, K, V, projection_dim=256):
"""
Linformer: project K and V to lower dimension.
Q: (n, d), K: (n, d), V: (n, d)
Output: (n, d)
Complexity: O(n * k * d) instead of O(n² * d)
"""
n, d = Q.shape
k = projection_dim
# Learned projection matrices (initialized randomly here)
E_K = np.random.randn(k, n) / np.sqrt(k)
E_V = np.random.randn(k, n) / np.sqrt(k)
# Project K and V: (k, n) @ (n, d) = (k, d)
K_proj = E_K @ K # (k, d)
V_proj = E_V @ V # (k, d)
# Attention: Q @ K_proj^T = (n, d) @ (d, k) = (n, k) — LINEAR!
scores = Q @ K_proj.T / np.sqrt(d) # (n, k) instead of (n, n)
# Softmax over k (much smaller than n)
weights = np.exp(scores - scores.max(axis=-1, keepdims=True))
weights = weights / weights.sum(axis=-1, keepdims=True)
# Output: (n, k) @ (k, d) = (n, d)
output = weights @ V_proj
return outputReformer: LSH Attention
Reformer (Kitaev et al., 2020) uses Locality-Sensitive Hashing (LSH) to group similar queries and keys, then only computes attention within groups.
The idea: if and would have a high attention score, their hash values are likely to collide.
Complexity: with high probability.
Tokens with the same hash are grouped into buckets. Attention is computed only within buckets.
Linear Attention: The O(n) Solution
Linear attention (Katharopoulos et al., 2020) replaces softmax with a kernel function :
Standard:
Linear:
The key trick is changing the order of operations:
Standard: → compute first (), then multiply by
Linear: → compute first (), then multiply by
Since for long contexts, this is a massive improvement!
def linear_attention(Q, K, V, eps=1e-6):
"""
Linear attention using ELU+1 kernel.
Instead of softmax(QK^T)V, compute Q(K^T V).
Complexity: O(n * d²) instead of O(n² * d)
"""
# Kernel function: φ(x) = ELU(x) + 1
def phi(x):
return np.maximum(x, 0) + np.minimum(np.exp(x) - 1, 0) + 1
Q_prime = phi(Q) # (n, d)
K_prime = phi(K) # (n, d)
# Key insight: compute K^T V first (d × d matrix)
KV = K_prime.T @ V # (d, d) — independent of n!
# Then multiply by Q
numerator = Q_prime @ KV # (n, d)
# Normalization (replaces softmax denominator)
K_sum = K_prime.sum(axis=0, keepdims=True) # (1, d)
denominator = Q_prime @ K_sum.T # (n, 1)
output = numerator / (denominator + eps)
return output
# Compare complexities
n_values = [1000, 10000, 100000, 1000000]
d = 128
print(f"{'n':>10} {'O(n²d)':>15} {'O(nd²)':>15} {'Speedup':>10}")
print("=" * 55)
for n in n_values:
quad = n * n * d
linear = n * d * d
print(f"{n:>10,} {quad:>15,.0f} {linear:>15,.0f} {quad/linear:>10,.0f}×")Output:
n O(n²d) O(nd²) Speedup
=======================================================
1,000 128,000,000 16,384,000 8×
10,000 12,800,000,000 163,840,000 78×
100,000 1.28e+12 1,638,400,000 781×
1,000,000 1.28e+14 16,384,000,000 7,813×Comparison Table
| Method | Complexity | Memory | Quality vs Full | Used In |
|---|---|---|---|---|
| Full attention | Baseline | GPT-4, Claude | ||
| Longformer | ~95-98% | Long document tasks | ||
| Linformer | ~93-96% | Research | ||
| Reformer | ~92-95% | Research | ||
| Linear attention | ~88-93% | Mamba-hybrid models | ||
| Flash Attention | 100% (exact) | Everywhere |
Note: Flash Attention doesn’t reduce compute — it reduces memory. Sparse methods reduce both but sacrifice quality.
Why Production Models Still Use Full Attention
Despite these innovations, most production LLMs (GPT-4, Claude, Gemini, Llama) use full attention with Flash Attention rather than sparse alternatives. Why?
- Quality gap matters at scale. A 2-5% accuracy drop seems small but compounds across billions of queries.
- Flash Attention solved the memory problem. The main motivation for sparse attention was memory, and Flash Attention handles that.
- Hardware is improving faster than the quality gap is closing. H100s are 2-3× faster than A100s; the next generation will be faster still.
- Hybrid approaches are emerging. Models like Jamba use attention for some layers and linear attention (Mamba) for others — getting the best of both worlds.
The future likely isn’t “sparse attention replaces full attention” but rather “intelligent combination of both.”
ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Private Code Context retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai