Transformers have no inherent notion of order. RoPE encodes relative positions via rotation matrices. Here's the full math from sinusoidal to NTK-aware scaling.
Imagine reading a sentence where all the words are on separate cards, shuffled randomly: “dog the lazy over jumps fox brown quick the.” Without knowing the original order, the meaning is lost.
Transformers have the same problem. Unlike humans who read left-to-right, transformers process all tokens simultaneously — they see a set of tokens, not a sequence. Without position information, “the dog bit the man” and “the man bit the dog” would be identical.
Position encodings inject order information into the tokens so the model knows which word came first, second, third, etc.
The original transformer (Vaswani et al., 2017) used sine and cosine waves at different frequencies:
Where is the position and is the dimension index.
import numpy as np
def sinusoidal_position_encoding(max_len, d_model):
"""Original sinusoidal position encoding from 'Attention Is All You Need'."""
pe = np.zeros((max_len, d_model))
position = np.arange(max_len)[:, np.newaxis] # (max_len, 1)
div_term = 10000 ** (2 * np.arange(d_model // 2) / d_model) # (d/2,)
pe[:, 0::2] = np.sin(position / div_term) # Even dimensions
pe[:, 1::2] = np.cos(position / div_term) # Odd dimensions
return pe
# Generate for 100 positions, dimension 64
pe = sinusoidal_position_encoding(100, 64)
print(f"Position encoding shape: {pe.shape}")
print(f"Position 0, first 8 dims: {pe[0, :8].round(3)}")
print(f"Position 1, first 8 dims: {pe[1, :8].round(3)}")The problem: sinusoidal encodings are absolute — each position gets a fixed vector. The model must learn that position 5 has a certain relationship to position 10. It cannot generalize to positions it hasn’t seen during training.
RoPE (Su et al., 2021) is the position encoding used by virtually every modern LLM (Llama, Mistral, Qwen, Gemini). The key idea: encode positions by rotating query and key vectors.
Instead of adding a position vector, RoPE rotates the query and key vectors by an angle proportional to their position. When computing the dot product , the rotation angles cancel in a way that only the relative position matters.
For a pair of dimensions at position , apply a rotation:
Where is the frequency for dimension pair .
For a -dimensional vector, RoPE applies independent rotations to pairs:
The rotated query and key at positions and :
The attention score between position and :
Since rotation matrices satisfy :
Therefore:
The score depends only on relative position , not absolute positions and . QED.
import numpy as np
def compute_rope_frequencies(dim: int, base: float = 10000.0) -> np.ndarray:
"""Compute the rotation frequencies for RoPE."""
freqs = 1.0 / (base ** (np.arange(0, dim, 2).astype(float) / dim))
return freqs
def apply_rope(x: np.ndarray, positions: np.ndarray, base: float = 10000.0):
"""
Apply Rotary Position Embeddings to input tensor.
Args:
x: Input tensor of shape (seq_len, dim)
positions: Position indices of shape (seq_len,)
base: Base frequency (default 10000)
Returns:
Rotated tensor of same shape
"""
seq_len, dim = x.shape
assert dim % 2 == 0, "Dimension must be even"
# Compute frequencies: θ_i = base^(-2i/d)
freqs = compute_rope_frequencies(dim, base)
# Compute angles: m * θ_i for each position m
angles = np.outer(positions, freqs) # (seq_len, dim/2)
# Split input into pairs
x_even = x[:, 0::2] # Even dimensions
x_odd = x[:, 1::2] # Odd dimensions
# Apply rotation
cos_angles = np.cos(angles)
sin_angles = np.sin(angles)
x_rotated_even = x_even * cos_angles - x_odd * sin_angles
x_rotated_odd = x_even * sin_angles + x_odd * cos_angles
# Interleave back
result = np.empty_like(x)
result[:, 0::2] = x_rotated_even
result[:, 1::2] = x_rotated_odd
return result
# Demonstrate RoPE
dim = 8
seq_len = 5
# Random query and key vectors
np.random.seed(42)
q = np.random.randn(seq_len, dim)
k = np.random.randn(seq_len, dim)
positions = np.arange(seq_len)
# Apply RoPE
q_rotated = apply_rope(q, positions)
k_rotated = apply_rope(k, positions)
# Verify: dot product depends on relative position
# q[2] · k[0] should equal q[5] · k[3] if x values are same
# (both have relative distance of 2)
print("Attention scores (raw):")
scores_raw = q @ k.T
print(scores_raw.round(3))
print("\nAttention scores (with RoPE):")
scores_rope = q_rotated @ k_rotated.T
print(scores_rope.round(3))RoPE was trained with a specific maximum context length (e.g., 4K or 8K tokens). At positions beyond the training length, the rotation angles exceed the range seen during training.
For high-frequency components ( is large), the angles wrap around quickly — this is fine. But for low-frequency components ( is small), the angle at position is:
At (trained): angle radians At (extrapolating): angle radians
The model has never seen angles this large — it doesn’t know how to interpret them.
NTK-aware scaling (bloc97, 2023) modifies the base frequency to spread the same angular range over more positions:
Where .
This is equivalent to scaling the base:
def apply_rope_ntk_scaled(x, positions, base=10000.0,
train_length=4096, target_length=128000):
"""RoPE with NTK-aware scaling for length extension."""
alpha = target_length / train_length
# Scale the base frequency
scaled_base = base * alpha ** (x.shape[1] / (x.shape[1] - 2))
return apply_rope(x, positions, base=scaled_base)Attention with Linear Biases (ALiBi) (Press et al., 2022) takes a completely different approach. Instead of modifying embeddings, it adds a linear penalty based on distance:
Where is a head-specific slope. Each head has a different slope, allowing different heads to have different “attention spans.”
def alibi_bias(seq_len: int, n_heads: int) -> np.ndarray:
"""Compute ALiBi attention bias matrix."""
# Slopes: geometric sequence from 2^(-8/n_heads) to 2^(-8)
slopes = 2 ** (-8 * np.arange(1, n_heads + 1) / n_heads)
# Distance matrix
positions = np.arange(seq_len)
distances = positions[:, None] - positions[None, :] # (seq_len, seq_len)
# Bias: -slope * |distance| for each head
bias = np.zeros((n_heads, seq_len, seq_len))
for h in range(n_heads):
bias[h] = -slopes[h] * np.abs(distances)
return bias
bias = alibi_bias(10, 8)
print(f"ALiBi bias shape: {bias.shape}")
print(f"Head 0, first 5×5:\n{bias[0, :5, :5].round(3)}")ALiBi pros: No learned parameters, naturally extrapolates to any length ALiBi cons: Fixed linear decay may not capture all positional relationships
| Method | Type | Length Extrapolation | Used By |
|---|---|---|---|
| Sinusoidal | Absolute | Poor | Original Transformer |
| Learned | Absolute | None | GPT-2 |
| RoPE | Relative (rotation) | Good (with scaling) | Llama, Mistral, Qwen |
| ALiBi | Relative (bias) | Excellent | MPT, BLOOM |
| NTK-RoPE | Relative (scaled rotation) | Very good | Extended Llama |
| YaRN | Relative (mixed scaling) | Very good | Many fine-tuned models |
RoPE dominates the field today because it offers the best combination of quality, flexibility, and extensibility.
ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai