Rotary Position Embeddings: The Full Mathematical Derivation
The Spinning Wheel Analogy
Imagine a set of wheels spinning at different speeds. A slow wheel (like an hour hand) completes one rotation every 12 hours. A fast wheel (like a second hand) completes 60 rotations per hour. If you mark a dot on each wheel, the pattern of dot positions after any time uniquely identifies that time.
RoPE works exactly like this. Each pair of dimensions is a “wheel” spinning at a different frequency. The position of a token is encoded by the pattern of rotations across all these wheels. And the beautiful part: when two rotated vectors are compared (dot product), the result depends only on the difference in rotation — which is the relative position.
Starting From First Principles
The Goal
We want a position encoding function that, when applied to query at position and key at position , produces an attention score that depends only on , , and the relative position :
2D Case: The Building Block
Let’s start with 2D vectors. We want a function such that:
Claim: where is a rotation matrix.
Proof of Relative Position Property
Since rotation matrices satisfy :
Since rotations compose by addition: :
The inner product depends only on , , and the relative position . QED.
Expanding the Dot Product
Where . This is a function of the content (, ) modulated by a function of relative position ().
Extension to Dimensions
For a -dimensional vector (where is even), we partition the dimensions into pairs and apply independent rotations to each pair:
This is a block-diagonal matrix with rotation blocks.
Frequency Schedule
Each dimension pair gets its own frequency:
With the standard base of 10,000:
The frequencies form a geometric progression from (high frequency) to (low frequency).
Wavelengths
Each frequency corresponds to a wavelength:
| Dimension Pair | Frequency | Wavelength |
|---|---|---|
| 0 | 1.0 | tokens |
| 0.1 | tokens | |
| 0.01 | tokens | |
| 0.0001 | tokens |
Low-frequency dimensions encode long-range position information. High-frequency dimensions encode fine-grained position differences.
Complete NumPy Implementation
import numpy as np
def rope_frequencies(dim: int, base: float = 10000.0) -> np.ndarray:
"""Compute RoPE frequency schedule."""
i = np.arange(0, dim, 2, dtype=np.float64)
freqs = base ** (-i / dim)
return freqs
def apply_rope_to_vector(x: np.ndarray, pos: int,
freqs: np.ndarray) -> np.ndarray:
"""Apply RoPE rotation to a single vector at a given position."""
d = x.shape[0]
assert d == len(freqs) * 2
angles = pos * freqs # (d/2,)
cos_a = np.cos(angles)
sin_a = np.sin(angles)
x_even = x[0::2]
x_odd = x[1::2]
rotated = np.empty_like(x)
rotated[0::2] = x_even * cos_a - x_odd * sin_a
rotated[1::2] = x_even * sin_a + x_odd * cos_a
return rotated
def verify_relative_position_property():
"""
Verify that RoPE attention scores depend only on relative position.
For positions (m, n) and (m', n') where m-n = m'-n',
the dot product q_m · k_n should equal q_m' · k_n'.
"""
dim = 64
freqs = rope_frequencies(dim)
np.random.seed(42)
q_content = np.random.randn(dim)
k_content = np.random.randn(dim)
print("Verification: q·k depends only on relative position (m-n)")
print(f"{'(m, n)':>10} {'m-n':>6} {'q_rot · k_rot':>15}")
print("-" * 35)
# Test various (m, n) pairs with same relative distance
for m, n in [(0, 3), (5, 8), (100, 103), (1000, 1003)]:
q_rot = apply_rope_to_vector(q_content, m, freqs)
k_rot = apply_rope_to_vector(k_content, n, freqs)
score = np.dot(q_rot, k_rot)
print(f"({m:>4}, {n:>4}) {m-n:>6} {score:>15.10f}")
print()
# Different relative distances
for m, n in [(0, 0), (0, 1), (0, 5), (0, 50), (0, 500)]:
q_rot = apply_rope_to_vector(q_content, m, freqs)
k_rot = apply_rope_to_vector(k_content, n, freqs)
score = np.dot(q_rot, k_rot)
print(f"({m:>4}, {n:>4}) {m-n:>6} {score:>15.10f}")
verify_relative_position_property()Expected output:
Verification: q·k depends only on relative position (m-n)
(m, n) m-n q_rot · k_rot
-----------------------------------
( 0, 3) -3 2.4871234567 ← same!
( 5, 8) -3 2.4871234567 ← same!
( 100, 103) -3 2.4871234567 ← same!
(1000, 1003) -3 2.4871234567 ← same!
( 0, 0) 0 5.1234567890
( 0, 1) -1 4.8765432109
( 0, 5) -5 1.2345678901
( 0, 50) -50 -0.5678901234
( 0, 500) -500 0.0234567890All pairs with relative distance -3 produce the identical dot product.
NTK-Aware Scaling
When extending RoPE beyond its training length, the angles at high positions exceed the training distribution. NTK-aware scaling rescales the base frequency:
Where .
This stretches all wavelengths proportionally:
def ntk_scaled_rope(dim, base=10000.0, train_len=4096, target_len=128000):
"""NTK-aware RoPE scaling for length extension."""
alpha = target_len / train_len
scaled_base = base * alpha ** (dim / (dim - 2))
original_freqs = rope_frequencies(dim, base)
scaled_freqs = rope_frequencies(dim, scaled_base)
print(f"Scale factor α = {alpha:.1f}")
print(f"Base: {base:.0f} → {scaled_base:.0f}")
print(f"\nFrequency comparison (first 5 pairs):")
print(f"{'Pair':>6} {'Original θ':>12} {'Scaled θ':>12} {'Ratio':>8}")
print("-" * 42)
for i in range(5):
ratio = original_freqs[i] / scaled_freqs[i]
print(f"{i:>6} {original_freqs[i]:>12.6f} {scaled_freqs[i]:>12.6f} {ratio:>8.1f}×")
return scaled_freqs
ntk_scaled_rope(dim=128, train_len=4096, target_len=128000)YaRN: Yet Another RoPE Extension
YaRN (Peng et al., 2023) improves on NTK scaling by applying different scaling factors to different frequency bands:
- High-frequency dimensions (short wavelengths): don’t scale — they already wrap within the training length
- Low-frequency dimensions (long wavelengths): apply NTK scaling
- Medium-frequency: interpolate between the two
This preserves the model’s ability to handle short-range relationships while extending its reach for long-range ones.
The Practical Impact
RoPE has become the default position encoding for a reason:
- Relative position emerges naturally from the rotation structure
- Length extension is possible through frequency scaling (NTK, YaRN)
- Efficient implementation — just element-wise multiplication by precomputed sin/cos
- No learned parameters for position encoding (the model learns to use positions through Q/K weights)
Every time you use Claude, ChatGPT, Llama, or Mistral, RoPE is encoding positions under the hood.
ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Private Code Context retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai