Multiple research approaches attack the quadratic bottleneck: Longformer, Reformer, Linformer, and linear attention. Here's the math behind each one.
At a party of 100 people, having everyone talk to everyone requires conversations. But do you really need all of them? What if:
This is the core idea behind sparse attention: don’t compute all interactions. Only compute the ones that matter.
Full self-attention:
Complexity: compute, memory for the attention matrix.
Every sparse attention method asks: “Which entries of the matrix can we skip?”
Longformer (Beltagy et al., 2020) combines sliding window attention with global tokens.
Each token attends only to a window of neighbors on each side:
Complexity: instead of .
Selected tokens (e.g., [CLS], task-specific tokens) attend to ALL tokens:
If there are global tokens: .
For and , at :
~200× reduction!
import numpy as np
def longformer_attention_mask(n: int, window_size: int,
global_indices: list) -> np.ndarray:
"""
Create a Longformer-style attention mask.
1 = can attend, 0 = cannot attend
"""
mask = np.zeros((n, n), dtype=np.float32)
# Sliding window: each token attends to ±window_size
for i in range(n):
left = max(0, i - window_size)
right = min(n, i + window_size + 1)
mask[i, left:right] = 1.0
# Global tokens: attend to and from everywhere
for g in global_indices:
mask[g, :] = 1.0 # Global token attends to all
mask[:, g] = 1.0 # All tokens attend to global
return mask
# Example: 20 tokens, window=3, 2 global tokens
mask = longformer_attention_mask(20, window_size=3, global_indices=[0, 19])
density = mask.sum() / (20 * 20)
print(f"Attention density: {density:.1%} (vs 100% for full attention)")Linformer (Wang et al., 2020) observes that the attention matrix is approximately low-rank — meaning most of its information can be captured by a much smaller matrix.
Instead of computing , project K and V down to dimension :
Where are learned projection matrices.
Attention becomes:
Now instead of .
Complexity: — linear in !
def linformer_attention(Q, K, V, projection_dim=256):
"""
Linformer: project K and V to lower dimension.
Q: (n, d), K: (n, d), V: (n, d)
Output: (n, d)
Complexity: O(n * k * d) instead of O(n² * d)
"""
n, d = Q.shape
k = projection_dim
# Learned projection matrices (initialized randomly here)
E_K = np.random.randn(k, n) / np.sqrt(k)
E_V = np.random.randn(k, n) / np.sqrt(k)
# Project K and V: (k, n) @ (n, d) = (k, d)
K_proj = E_K @ K # (k, d)
V_proj = E_V @ V # (k, d)
# Attention: Q @ K_proj^T = (n, d) @ (d, k) = (n, k) — LINEAR!
scores = Q @ K_proj.T / np.sqrt(d) # (n, k) instead of (n, n)
# Softmax over k (much smaller than n)
weights = np.exp(scores - scores.max(axis=-1, keepdims=True))
weights = weights / weights.sum(axis=-1, keepdims=True)
# Output: (n, k) @ (k, d) = (n, d)
output = weights @ V_proj
return outputReformer (Kitaev et al., 2020) uses Locality-Sensitive Hashing (LSH) to group similar queries and keys, then only computes attention within groups.
The idea: if and would have a high attention score, their hash values are likely to collide.
Complexity: with high probability.
Tokens with the same hash are grouped into buckets. Attention is computed only within buckets.
Linear attention (Katharopoulos et al., 2020) replaces softmax with a kernel function :
Standard:
Linear:
The key trick is changing the order of operations:
Standard: → compute first (), then multiply by
Linear: → compute first (), then multiply by
Since for long contexts, this is a massive improvement!
def linear_attention(Q, K, V, eps=1e-6):
"""
Linear attention using ELU+1 kernel.
Instead of softmax(QK^T)V, compute Q(K^T V).
Complexity: O(n * d²) instead of O(n² * d)
"""
# Kernel function: φ(x) = ELU(x) + 1
def phi(x):
return np.maximum(x, 0) + np.minimum(np.exp(x) - 1, 0) + 1
Q_prime = phi(Q) # (n, d)
K_prime = phi(K) # (n, d)
# Key insight: compute K^T V first (d × d matrix)
KV = K_prime.T @ V # (d, d) — independent of n!
# Then multiply by Q
numerator = Q_prime @ KV # (n, d)
# Normalization (replaces softmax denominator)
K_sum = K_prime.sum(axis=0, keepdims=True) # (1, d)
denominator = Q_prime @ K_sum.T # (n, 1)
output = numerator / (denominator + eps)
return output
# Compare complexities
n_values = [1000, 10000, 100000, 1000000]
d = 128
print(f"{'n':>10} {'O(n²d)':>15} {'O(nd²)':>15} {'Speedup':>10}")
print("=" * 55)
for n in n_values:
quad = n * n * d
linear = n * d * d
print(f"{n:>10,} {quad:>15,.0f} {linear:>15,.0f} {quad/linear:>10,.0f}×")Output:
n O(n²d) O(nd²) Speedup
=======================================================
1,000 128,000,000 16,384,000 8×
10,000 12,800,000,000 163,840,000 78×
100,000 1.28e+12 1,638,400,000 781×
1,000,000 1.28e+14 16,384,000,000 7,813×| Method | Complexity | Memory | Quality vs Full | Used In |
|---|---|---|---|---|
| Full attention | Baseline | GPT-4, Claude | ||
| Longformer | ~95-98% | Long document tasks | ||
| Linformer | ~93-96% | Research | ||
| Reformer | ~92-95% | Research | ||
| Linear attention | ~88-93% | Mamba-hybrid models | ||
| Flash Attention | 100% (exact) | Everywhere |
Note: Flash Attention doesn’t reduce compute — it reduces memory. Sparse methods reduce both but sacrifice quality.
Despite these innovations, most production LLMs (GPT-4, Claude, Gemini, Llama) use full attention with Flash Attention rather than sparse alternatives. Why?
The future likely isn’t “sparse attention replaces full attention” but rather “intelligent combination of both.”
ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai