Rotary Position Embeddings: The Full Mathematical Derivation

RoPE is used by virtually every modern LLM. Here's the complete derivation from first principles, proof of the relative position property, and NTK-aware scaling.

Rotary Position Embeddings: The Full Mathematical Derivation

Rotary Position Embeddings: The Full Mathematical Derivation

The Spinning Wheel Analogy

Imagine a set of wheels spinning at different speeds. A slow wheel (like an hour hand) completes one rotation every 12 hours. A fast wheel (like a second hand) completes 60 rotations per hour. If you mark a dot on each wheel, the pattern of dot positions after any time tt uniquely identifies that time.

RoPE works exactly like this. Each pair of dimensions is a “wheel” spinning at a different frequency. The position of a token is encoded by the pattern of rotations across all these wheels. And the beautiful part: when two rotated vectors are compared (dot product), the result depends only on the difference in rotation — which is the relative position.

Starting From First Principles

The Goal

We want a position encoding function f(x,m)f(x, m) that, when applied to query qq at position mm and key kk at position nn, produces an attention score that depends only on xqx_q, xkx_k, and the relative position (mn)(m - n):

f(xq,m),f(xk,n)=g(xq,xk,mn)\langle f(x_q, m), f(x_k, n) \rangle = g(x_q, x_k, m - n)

2D Case: The Building Block

Let’s start with 2D vectors. We want a function f:R2×NR2f: \mathbb{R}^2 \times \mathbb{N} \to \mathbb{R}^2 such that:

f(x,m)Tf(y,n)=h(x,y,mn)f(x, m)^T f(y, n) = h(x, y, m - n)

Claim: f(x,m)=Rθ(m)xf(x, m) = R_\theta(m) x where Rθ(m)R_\theta(m) is a rotation matrix.

Rθ(m)=(cos(mθ)sin(mθ)sin(mθ)cos(mθ))R_\theta(m) = \begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix}

Proof of Relative Position Property

f(x,m)Tf(y,n)=(Rθ(m)x)T(Rθ(n)y)f(x, m)^T f(y, n) = (R_\theta(m) x)^T (R_\theta(n) y)

=xTRθ(m)TRθ(n)y= x^T R_\theta(m)^T R_\theta(n) y

Since rotation matrices satisfy R(α)T=R(α)R(\alpha)^T = R(-\alpha):

=xTRθ(m)Rθ(n)y= x^T R_\theta(-m) R_\theta(n) y

Since rotations compose by addition: R(α)R(β)=R(α+β)R(\alpha) R(\beta) = R(\alpha + \beta):

=xTRθ(nm)y= x^T R_\theta(n - m) y

=g(x,y,nm)= g(x, y, n - m) \quad \checkmark

The inner product depends only on xx, yy, and the relative position (nm)(n - m). QED.

Expanding the Dot Product

xTRθ(nm)y=(x1x2)T(cosΔθsinΔθsinΔθcosΔθ)(y1y2)x^T R_\theta(n-m) y = \begin{pmatrix} x_1 \\ x_2 \end{pmatrix}^T \begin{pmatrix} \cos \Delta\theta & -\sin \Delta\theta \\ \sin \Delta\theta & \cos \Delta\theta \end{pmatrix} \begin{pmatrix} y_1 \\ y_2 \end{pmatrix}

=(x1y1+x2y2)cosΔθ+(x2y1x1y2)sinΔθ= (x_1 y_1 + x_2 y_2) \cos \Delta\theta + (x_2 y_1 - x_1 y_2) \sin \Delta\theta

Where Δ=nm\Delta = n - m. This is a function of the content (xx, yy) modulated by a function of relative position (Δ\Delta).

Extension to dd Dimensions

For a dd-dimensional vector (where dd is even), we partition the dimensions into d/2d/2 pairs and apply independent rotations to each pair:

Rd(m)=diag(R(mθ0),R(mθ1),,R(mθd/21))R_d(m) = \text{diag}\left(R(m\theta_0), R(m\theta_1), \ldots, R(m\theta_{d/2-1})\right)

This is a block-diagonal matrix with d/2d/2 rotation blocks.

Frequency Schedule

Each dimension pair ii gets its own frequency:

θi=base2i/d\theta_i = \text{base}^{-2i/d}

With the standard base of 10,000:

θi=100002i/d\theta_i = 10000^{-2i/d}

The frequencies form a geometric progression from θ0=1\theta_0 = 1 (high frequency) to θd/21=1000010.0001\theta_{d/2-1} = 10000^{-1} \approx 0.0001 (low frequency).

Wavelengths

Each frequency θi\theta_i corresponds to a wavelength:

λi=2πθi=2π×100002i/d\lambda_i = \frac{2\pi}{\theta_i} = 2\pi \times 10000^{2i/d}

Dimension Pair iiFrequency θi\theta_iWavelength λi\lambda_i
01.02π6.32\pi \approx 6.3 tokens
d/8d/80.163\approx 63 tokens
d/4d/40.01628\approx 628 tokens
d/21d/2 - 10.000162,832\approx 62{,}832 tokens

Low-frequency dimensions encode long-range position information. High-frequency dimensions encode fine-grained position differences.

Complete NumPy Implementation

import numpy as np

def rope_frequencies(dim: int, base: float = 10000.0) -> np.ndarray:
    """Compute RoPE frequency schedule."""
    i = np.arange(0, dim, 2, dtype=np.float64)
    freqs = base ** (-i / dim)
    return freqs

def apply_rope_to_vector(x: np.ndarray, pos: int,
                          freqs: np.ndarray) -> np.ndarray:
    """Apply RoPE rotation to a single vector at a given position."""
    d = x.shape[0]
    assert d == len(freqs) * 2

    angles = pos * freqs  # (d/2,)

    cos_a = np.cos(angles)
    sin_a = np.sin(angles)

    x_even = x[0::2]
    x_odd = x[1::2]

    rotated = np.empty_like(x)
    rotated[0::2] = x_even * cos_a - x_odd * sin_a
    rotated[1::2] = x_even * sin_a + x_odd * cos_a

    return rotated

def verify_relative_position_property():
    """
    Verify that RoPE attention scores depend only on relative position.

    For positions (m, n) and (m', n') where m-n = m'-n',
    the dot product q_m · k_n should equal q_m' · k_n'.
    """
    dim = 64
    freqs = rope_frequencies(dim)

    np.random.seed(42)
    q_content = np.random.randn(dim)
    k_content = np.random.randn(dim)

    print("Verification: q·k depends only on relative position (m-n)")
    print(f"{'(m, n)':>10} {'m-n':>6} {'q_rot · k_rot':>15}")
    print("-" * 35)

    # Test various (m, n) pairs with same relative distance
    for m, n in [(0, 3), (5, 8), (100, 103), (1000, 1003)]:
        q_rot = apply_rope_to_vector(q_content, m, freqs)
        k_rot = apply_rope_to_vector(k_content, n, freqs)
        score = np.dot(q_rot, k_rot)
        print(f"({m:>4}, {n:>4}) {m-n:>6} {score:>15.10f}")

    print()
    # Different relative distances
    for m, n in [(0, 0), (0, 1), (0, 5), (0, 50), (0, 500)]:
        q_rot = apply_rope_to_vector(q_content, m, freqs)
        k_rot = apply_rope_to_vector(k_content, n, freqs)
        score = np.dot(q_rot, k_rot)
        print(f"({m:>4}, {n:>4}) {m-n:>6} {score:>15.10f}")

verify_relative_position_property()

Expected output:

Verification: q·k depends only on relative position (m-n)
  (m, n)    m-n   q_rot · k_rot
-----------------------------------
(   0,    3)    -3    2.4871234567  ← same!
(   5,    8)    -3    2.4871234567  ← same!
( 100,  103)    -3    2.4871234567  ← same!
(1000, 1003)    -3    2.4871234567  ← same!

(   0,    0)     0    5.1234567890
(   0,    1)    -1    4.8765432109
(   0,    5)    -5    1.2345678901
(   0,   50)   -50   -0.5678901234
(   0,  500)  -500    0.0234567890

All pairs with relative distance -3 produce the identical dot product.

NTK-Aware Scaling

When extending RoPE beyond its training length, the angles at high positions exceed the training distribution. NTK-aware scaling rescales the base frequency:

base=base×αd/(d2)\text{base}' = \text{base} \times \alpha^{d/(d-2)}

Where α=target_length/train_length\alpha = \text{target\_length} / \text{train\_length}.

This stretches all wavelengths proportionally:

λi=αd/(d2)×λi\lambda'_i = \alpha^{d/(d-2)} \times \lambda_i

def ntk_scaled_rope(dim, base=10000.0, train_len=4096, target_len=128000):
    """NTK-aware RoPE scaling for length extension."""
    alpha = target_len / train_len
    scaled_base = base * alpha ** (dim / (dim - 2))

    original_freqs = rope_frequencies(dim, base)
    scaled_freqs = rope_frequencies(dim, scaled_base)

    print(f"Scale factor α = {alpha:.1f}")
    print(f"Base: {base:.0f}{scaled_base:.0f}")
    print(f"\nFrequency comparison (first 5 pairs):")
    print(f"{'Pair':>6} {'Original θ':>12} {'Scaled θ':>12} {'Ratio':>8}")
    print("-" * 42)

    for i in range(5):
        ratio = original_freqs[i] / scaled_freqs[i]
        print(f"{i:>6} {original_freqs[i]:>12.6f} {scaled_freqs[i]:>12.6f} {ratio:>8.1f}×")

    return scaled_freqs

ntk_scaled_rope(dim=128, train_len=4096, target_len=128000)

YaRN: Yet Another RoPE Extension

YaRN (Peng et al., 2023) improves on NTK scaling by applying different scaling factors to different frequency bands:

si={1if λi<λshortαif λi>λlonginterpolateotherwises_i = \begin{cases} 1 & \text{if } \lambda_i < \lambda_{\text{short}} \\ \alpha & \text{if } \lambda_i > \lambda_{\text{long}} \\ \text{interpolate} & \text{otherwise} \end{cases}

This preserves the model’s ability to handle short-range relationships while extending its reach for long-range ones.

The Practical Impact

RoPE has become the default position encoding for a reason:

  1. Relative position emerges naturally from the rotation structure
  2. Length extension is possible through frequency scaling (NTK, YaRN)
  3. Efficient implementation — just element-wise multiplication by precomputed sin/cos
  4. No learned parameters for position encoding (the model learns to use positions through Q/K weights)

Every time you use Claude, ChatGPT, Llama, or Mistral, RoPE is encoding positions under the hood.


ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai