RoPE is used by virtually every modern LLM. Here's the complete derivation from first principles, proof of the relative position property, and NTK-aware scaling.
Imagine a set of wheels spinning at different speeds. A slow wheel (like an hour hand) completes one rotation every 12 hours. A fast wheel (like a second hand) completes 60 rotations per hour. If you mark a dot on each wheel, the pattern of dot positions after any time uniquely identifies that time.
RoPE works exactly like this. Each pair of dimensions is a “wheel” spinning at a different frequency. The position of a token is encoded by the pattern of rotations across all these wheels. And the beautiful part: when two rotated vectors are compared (dot product), the result depends only on the difference in rotation — which is the relative position.
We want a position encoding function that, when applied to query at position and key at position , produces an attention score that depends only on , , and the relative position :
Let’s start with 2D vectors. We want a function such that:
Claim: where is a rotation matrix.
Since rotation matrices satisfy :
Since rotations compose by addition: :
The inner product depends only on , , and the relative position . QED.
Where . This is a function of the content (, ) modulated by a function of relative position ().
For a -dimensional vector (where is even), we partition the dimensions into pairs and apply independent rotations to each pair:
This is a block-diagonal matrix with rotation blocks.
Each dimension pair gets its own frequency:
With the standard base of 10,000:
The frequencies form a geometric progression from (high frequency) to (low frequency).
Each frequency corresponds to a wavelength:
| Dimension Pair | Frequency | Wavelength |
|---|---|---|
| 0 | 1.0 | tokens |
| 0.1 | tokens | |
| 0.01 | tokens | |
| 0.0001 | tokens |
Low-frequency dimensions encode long-range position information. High-frequency dimensions encode fine-grained position differences.
import numpy as np
def rope_frequencies(dim: int, base: float = 10000.0) -> np.ndarray:
"""Compute RoPE frequency schedule."""
i = np.arange(0, dim, 2, dtype=np.float64)
freqs = base ** (-i / dim)
return freqs
def apply_rope_to_vector(x: np.ndarray, pos: int,
freqs: np.ndarray) -> np.ndarray:
"""Apply RoPE rotation to a single vector at a given position."""
d = x.shape[0]
assert d == len(freqs) * 2
angles = pos * freqs # (d/2,)
cos_a = np.cos(angles)
sin_a = np.sin(angles)
x_even = x[0::2]
x_odd = x[1::2]
rotated = np.empty_like(x)
rotated[0::2] = x_even * cos_a - x_odd * sin_a
rotated[1::2] = x_even * sin_a + x_odd * cos_a
return rotated
def verify_relative_position_property():
"""
Verify that RoPE attention scores depend only on relative position.
For positions (m, n) and (m', n') where m-n = m'-n',
the dot product q_m · k_n should equal q_m' · k_n'.
"""
dim = 64
freqs = rope_frequencies(dim)
np.random.seed(42)
q_content = np.random.randn(dim)
k_content = np.random.randn(dim)
print("Verification: q·k depends only on relative position (m-n)")
print(f"{'(m, n)':>10} {'m-n':>6} {'q_rot · k_rot':>15}")
print("-" * 35)
# Test various (m, n) pairs with same relative distance
for m, n in [(0, 3), (5, 8), (100, 103), (1000, 1003)]:
q_rot = apply_rope_to_vector(q_content, m, freqs)
k_rot = apply_rope_to_vector(k_content, n, freqs)
score = np.dot(q_rot, k_rot)
print(f"({m:>4}, {n:>4}) {m-n:>6} {score:>15.10f}")
print()
# Different relative distances
for m, n in [(0, 0), (0, 1), (0, 5), (0, 50), (0, 500)]:
q_rot = apply_rope_to_vector(q_content, m, freqs)
k_rot = apply_rope_to_vector(k_content, n, freqs)
score = np.dot(q_rot, k_rot)
print(f"({m:>4}, {n:>4}) {m-n:>6} {score:>15.10f}")
verify_relative_position_property()Expected output:
Verification: q·k depends only on relative position (m-n)
(m, n) m-n q_rot · k_rot
-----------------------------------
( 0, 3) -3 2.4871234567 ← same!
( 5, 8) -3 2.4871234567 ← same!
( 100, 103) -3 2.4871234567 ← same!
(1000, 1003) -3 2.4871234567 ← same!
( 0, 0) 0 5.1234567890
( 0, 1) -1 4.8765432109
( 0, 5) -5 1.2345678901
( 0, 50) -50 -0.5678901234
( 0, 500) -500 0.0234567890All pairs with relative distance -3 produce the identical dot product.
When extending RoPE beyond its training length, the angles at high positions exceed the training distribution. NTK-aware scaling rescales the base frequency:
Where .
This stretches all wavelengths proportionally:
def ntk_scaled_rope(dim, base=10000.0, train_len=4096, target_len=128000):
"""NTK-aware RoPE scaling for length extension."""
alpha = target_len / train_len
scaled_base = base * alpha ** (dim / (dim - 2))
original_freqs = rope_frequencies(dim, base)
scaled_freqs = rope_frequencies(dim, scaled_base)
print(f"Scale factor α = {alpha:.1f}")
print(f"Base: {base:.0f} → {scaled_base:.0f}")
print(f"\nFrequency comparison (first 5 pairs):")
print(f"{'Pair':>6} {'Original θ':>12} {'Scaled θ':>12} {'Ratio':>8}")
print("-" * 42)
for i in range(5):
ratio = original_freqs[i] / scaled_freqs[i]
print(f"{i:>6} {original_freqs[i]:>12.6f} {scaled_freqs[i]:>12.6f} {ratio:>8.1f}×")
return scaled_freqs
ntk_scaled_rope(dim=128, train_len=4096, target_len=128000)YaRN (Peng et al., 2023) improves on NTK scaling by applying different scaling factors to different frequency bands:
This preserves the model’s ability to handle short-range relationships while extending its reach for long-range ones.
RoPE has become the default position encoding for a reason:
Every time you use Claude, ChatGPT, Llama, or Mistral, RoPE is encoding positions under the hood.
ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai