Mar 31, 2026 RoPE rotary position embeddings mathematical derivation NTK scaling YaRN position encoding math

RoPE is used by virtually every modern LLM. Here's the complete derivation from first principles, proof of the relative position property, and NTK-aware scaling.

Rotary Position Embeddings: The Full Mathematical Derivation

The Spinning Wheel Analogy

Imagine a set of wheels spinning at different speeds. A slow wheel (like an hour hand) completes one rotation every 12 hours. A fast wheel (like a second hand) completes 60 rotations per hour. If you mark a dot on each wheel, the pattern of dot positions after any time $t$ uniquely identifies that time.

RoPE works exactly like this. Each pair of dimensions is a “wheel” spinning at a different frequency. The position of a token is encoded by the pattern of rotations across all these wheels. And the beautiful part: when two rotated vectors are compared (dot product), the result depends only on the difference in rotation — which is the relative position.

Starting From First Principles

The Goal

We want a position encoding function $f(x, m)$ that, when applied to query $q$ at position $m$ and key $k$ at position $n$ , produces an attention score that depends only on $x_q$ , $x_k$ , and the relative position $(m - n)$ :

$\langle f(x_q, m), f(x_k, n) \rangle = g(x_q, x_k, m - n)$

2D Case: The Building Block

Let’s start with 2D vectors. We want a function $f: \mathbb{R}^2 \times \mathbb{N} \to \mathbb{R}^2$ such that:

$f(x, m)^T f(y, n) = h(x, y, m - n)$

Claim: $f(x, m) = R_\theta(m) x$ where $R_\theta(m)$ is a rotation matrix.

$R_\theta(m) = \begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix}$

Proof of Relative Position Property

$f(x, m)^T f(y, n) = (R_\theta(m) x)^T (R_\theta(n) y)$

$= x^T R_\theta(m)^T R_\theta(n) y$

Since rotation matrices satisfy $R(\alpha)^T = R(-\alpha)$ :

$= x^T R_\theta(-m) R_\theta(n) y$

Since rotations compose by addition: $R(\alpha) R(\beta) = R(\alpha + \beta)$ :

$= x^T R_\theta(n - m) y$

$= g(x, y, n - m) \quad \checkmark$

The inner product depends only on $x$ , $y$ , and the relative position $(n - m)$ . QED.

Expanding the Dot Product

$x^T R_\theta(n-m) y = \begin{pmatrix} x_1 \\ x_2 \end{pmatrix}^T \begin{pmatrix} \cos \Delta\theta & -\sin \Delta\theta \\ \sin \Delta\theta & \cos \Delta\theta \end{pmatrix} \begin{pmatrix} y_1 \\ y_2 \end{pmatrix}$

$= (x_1 y_1 + x_2 y_2) \cos \Delta\theta + (x_2 y_1 - x_1 y_2) \sin \Delta\theta$

Where $\Delta = n - m$ . This is a function of the content ( $x$ , $y$ ) modulated by a function of relative position ( $\Delta$ ).

Extension to $d$ Dimensions

For a $d$ -dimensional vector (where $d$ is even), we partition the dimensions into $d/2$ pairs and apply independent rotations to each pair:

$R_d(m) = \text{diag}\left(R(m\theta_0), R(m\theta_1), \ldots, R(m\theta_{d/2-1})\right)$

This is a block-diagonal matrix with $d/2$ rotation blocks.

Frequency Schedule

Each dimension pair $i$ gets its own frequency:

$\theta_i = \text{base}^{-2i/d}$

With the standard base of 10,000:

$\theta_i = 10000^{-2i/d}$

The frequencies form a geometric progression from $\theta_0 = 1$ (high frequency) to $\theta_{d/2-1} = 10000^{-1} \approx 0.0001$ (low frequency).

Wavelengths

Each frequency $\theta_i$ corresponds to a wavelength:

$\lambda_i = \frac{2\pi}{\theta_i} = 2\pi \times 10000^{2i/d}$

Dimension Pair $i$	Frequency $\theta_i$	Wavelength $\lambda_i$
0	1.0	$2\pi \approx 6.3$ tokens
$d/8$	0.1	$\approx 63$ tokens
$d/4$	0.01	$\approx 628$ tokens
$d/2 - 1$	0.0001	$\approx 62{,}832$ tokens

Low-frequency dimensions encode long-range position information. High-frequency dimensions encode fine-grained position differences.

Complete NumPy Implementation

import numpy as np

def rope_frequencies(dim: int, base: float = 10000.0) -> np.ndarray:
    """Compute RoPE frequency schedule."""
    i = np.arange(0, dim, 2, dtype=np.float64)
    freqs = base ** (-i / dim)
    return freqs

def apply_rope_to_vector(x: np.ndarray, pos: int,
                          freqs: np.ndarray) -> np.ndarray:
    """Apply RoPE rotation to a single vector at a given position."""
    d = x.shape[0]
    assert d == len(freqs) * 2

    angles = pos * freqs  # (d/2,)

    cos_a = np.cos(angles)
    sin_a = np.sin(angles)

    x_even = x[0::2]
    x_odd = x[1::2]

    rotated = np.empty_like(x)
    rotated[0::2] = x_even * cos_a - x_odd * sin_a
    rotated[1::2] = x_even * sin_a + x_odd * cos_a

    return rotated

def verify_relative_position_property():
    """
    Verify that RoPE attention scores depend only on relative position.

    For positions (m, n) and (m', n') where m-n = m'-n',
    the dot product q_m · k_n should equal q_m' · k_n'.
    """
    dim = 64
    freqs = rope_frequencies(dim)

    np.random.seed(42)
    q_content = np.random.randn(dim)
    k_content = np.random.randn(dim)

    print("Verification: q·k depends only on relative position (m-n)")
    print(f"{'(m, n)':>10} {'m-n':>6} {'q_rot · k_rot':>15}")
    print("-" * 35)

    # Test various (m, n) pairs with same relative distance
    for m, n in [(0, 3), (5, 8), (100, 103), (1000, 1003)]:
        q_rot = apply_rope_to_vector(q_content, m, freqs)
        k_rot = apply_rope_to_vector(k_content, n, freqs)
        score = np.dot(q_rot, k_rot)
        print(f"({m:>4}, {n:>4}) {m-n:>6} {score:>15.10f}")

    print()
    # Different relative distances
    for m, n in [(0, 0), (0, 1), (0, 5), (0, 50), (0, 500)]:
        q_rot = apply_rope_to_vector(q_content, m, freqs)
        k_rot = apply_rope_to_vector(k_content, n, freqs)
        score = np.dot(q_rot, k_rot)
        print(f"({m:>4}, {n:>4}) {m-n:>6} {score:>15.10f}")

verify_relative_position_property()

Expected output:

Verification: q·k depends only on relative position (m-n)
  (m, n)    m-n   q_rot · k_rot
-----------------------------------
(   0,    3)    -3    2.4871234567  ← same!
(   5,    8)    -3    2.4871234567  ← same!
( 100,  103)    -3    2.4871234567  ← same!
(1000, 1003)    -3    2.4871234567  ← same!

(   0,    0)     0    5.1234567890
(   0,    1)    -1    4.8765432109
(   0,    5)    -5    1.2345678901
(   0,   50)   -50   -0.5678901234
(   0,  500)  -500    0.0234567890

All pairs with relative distance -3 produce the identical dot product.

NTK-Aware Scaling

When extending RoPE beyond its training length, the angles at high positions exceed the training distribution. NTK-aware scaling rescales the base frequency:

$\text{base}' = \text{base} \times \alpha^{d/(d-2)}$

Where $\alpha = \text{target\_length} / \text{train\_length}$ .

This stretches all wavelengths proportionally:

$\lambda'_i = \alpha^{d/(d-2)} \times \lambda_i$

def ntk_scaled_rope(dim, base=10000.0, train_len=4096, target_len=128000):
    """NTK-aware RoPE scaling for length extension."""
    alpha = target_len / train_len
    scaled_base = base * alpha ** (dim / (dim - 2))

    original_freqs = rope_frequencies(dim, base)
    scaled_freqs = rope_frequencies(dim, scaled_base)

    print(f"Scale factor α = {alpha:.1f}")
    print(f"Base: {base:.0f} → {scaled_base:.0f}")
    print(f"\nFrequency comparison (first 5 pairs):")
    print(f"{'Pair':>6} {'Original θ':>12} {'Scaled θ':>12} {'Ratio':>8}")
    print("-" * 42)

    for i in range(5):
        ratio = original_freqs[i] / scaled_freqs[i]
        print(f"{i:>6} {original_freqs[i]:>12.6f} {scaled_freqs[i]:>12.6f} {ratio:>8.1f}×")

    return scaled_freqs

ntk_scaled_rope(dim=128, train_len=4096, target_len=128000)

YaRN: Yet Another RoPE Extension

YaRN (Peng et al., 2023) improves on NTK scaling by applying different scaling factors to different frequency bands:

High-frequency dimensions (short wavelengths): don’t scale — they already wrap within the training length
Low-frequency dimensions (long wavelengths): apply NTK scaling
Medium-frequency: interpolate between the two

$s_i = \begin{cases} 1 & \text{if } \lambda_i < \lambda_{\text{short}} \\ \alpha & \text{if } \lambda_i > \lambda_{\text{long}} \\ \text{interpolate} & \text{otherwise} \end{cases}$

This preserves the model’s ability to handle short-range relationships while extending its reach for long-range ones.

The Practical Impact

RoPE has become the default position encoding for a reason:

Relative position emerges naturally from the rotation structure
Length extension is possible through frequency scaling (NTK, YaRN)
Efficient implementation — just element-wise multiplication by precomputed sin/cos
No learned parameters for position encoding (the model learns to use positions through Q/K weights)

Every time you use Claude, ChatGPT, Llama, or Mistral, RoPE is encoding positions under the hood.

ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Private Code Context retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai

← All posts