Position Encodings: How AI Knows Where Words Are (And Why It Breaks at Long Lengths)

Transformers have no inherent notion of order. RoPE encodes relative positions via rotation matrices. Here's the full math from sinusoidal to NTK-aware scaling.

Position Encodings: How AI Knows Where Words Are (And Why It Breaks at Long Lengths)

Position Encodings: How AI Knows Where Words Are

The Shuffled Deck Analogy

Imagine reading a sentence where all the words are on separate cards, shuffled randomly: “dog the lazy over jumps fox brown quick the.” Without knowing the original order, the meaning is lost.

Transformers have the same problem. Unlike humans who read left-to-right, transformers process all tokens simultaneously — they see a set of tokens, not a sequence. Without position information, “the dog bit the man” and “the man bit the dog” would be identical.

Position encodings inject order information into the tokens so the model knows which word came first, second, third, etc.

Sinusoidal Position Encodings (Original Transformer)

The original transformer (Vaswani et al., 2017) used sine and cosine waves at different frequencies:

PE(pos,2i)=sin(pos100002i/d)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)

PE(pos,2i+1)=cos(pos100002i/d)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)

Where pospos is the position and ii is the dimension index.

import numpy as np

def sinusoidal_position_encoding(max_len, d_model):
    """Original sinusoidal position encoding from 'Attention Is All You Need'."""
    pe = np.zeros((max_len, d_model))
    position = np.arange(max_len)[:, np.newaxis]  # (max_len, 1)
    div_term = 10000 ** (2 * np.arange(d_model // 2) / d_model)  # (d/2,)

    pe[:, 0::2] = np.sin(position / div_term)  # Even dimensions
    pe[:, 1::2] = np.cos(position / div_term)  # Odd dimensions

    return pe

# Generate for 100 positions, dimension 64
pe = sinusoidal_position_encoding(100, 64)
print(f"Position encoding shape: {pe.shape}")
print(f"Position 0, first 8 dims: {pe[0, :8].round(3)}")
print(f"Position 1, first 8 dims: {pe[1, :8].round(3)}")

The problem: sinusoidal encodings are absolute — each position gets a fixed vector. The model must learn that position 5 has a certain relationship to position 10. It cannot generalize to positions it hasn’t seen during training.

RoPE: Rotary Position Embeddings

RoPE (Su et al., 2021) is the position encoding used by virtually every modern LLM (Llama, Mistral, Qwen, Gemini). The key idea: encode positions by rotating query and key vectors.

The Core Insight

Instead of adding a position vector, RoPE rotates the query and key vectors by an angle proportional to their position. When computing the dot product qkq \cdot k, the rotation angles cancel in a way that only the relative position (mn)(m - n) matters.

The 2D Rotation Matrix

For a pair of dimensions (2i,2i+1)(2i, 2i+1) at position mm, apply a rotation:

R(m,θi)=(cos(mθi)sin(mθi)sin(mθi)cos(mθi))R(m, \theta_i) = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix}

Where θi=100002i/d\theta_i = 10000^{-2i/d} is the frequency for dimension pair ii.

The Full Rotation Matrix

For a dd-dimensional vector, RoPE applies independent rotations to d/2d/2 pairs:

Rd(m)=diag(R(m,θ0),R(m,θ1),,R(m,θd/21))R_d(m) = \text{diag}\left(R(m, \theta_0), R(m, \theta_1), \ldots, R(m, \theta_{d/2-1})\right)

The rotated query and key at positions mm and nn:

qm=Rd(m)Wqxm,kn=Rd(n)Wkxnq_m = R_d(m) W_q x_m, \quad k_n = R_d(n) W_k x_n

The Relative Position Proof

The attention score between position mm and nn:

qmTkn=(Rd(m)Wqxm)T(Rd(n)Wkxn)q_m^T k_n = (R_d(m) W_q x_m)^T (R_d(n) W_k x_n)

=xmTWqTRd(m)TRd(n)Wkxn= x_m^T W_q^T R_d(m)^T R_d(n) W_k x_n

Since rotation matrices satisfy R(α)TR(β)=R(βα)R(\alpha)^T R(\beta) = R(\beta - \alpha):

Rd(m)TRd(n)=Rd(nm)R_d(m)^T R_d(n) = R_d(n - m)

Therefore:

qmTkn=xmTWqTRd(nm)Wkxnq_m^T k_n = x_m^T W_q^T R_d(n - m) W_k x_n

The score depends only on relative position (nm)(n - m), not absolute positions mm and nn. QED.

NumPy Implementation

import numpy as np

def compute_rope_frequencies(dim: int, base: float = 10000.0) -> np.ndarray:
    """Compute the rotation frequencies for RoPE."""
    freqs = 1.0 / (base ** (np.arange(0, dim, 2).astype(float) / dim))
    return freqs

def apply_rope(x: np.ndarray, positions: np.ndarray, base: float = 10000.0):
    """
    Apply Rotary Position Embeddings to input tensor.

    Args:
        x: Input tensor of shape (seq_len, dim)
        positions: Position indices of shape (seq_len,)
        base: Base frequency (default 10000)

    Returns:
        Rotated tensor of same shape
    """
    seq_len, dim = x.shape
    assert dim % 2 == 0, "Dimension must be even"

    # Compute frequencies: θ_i = base^(-2i/d)
    freqs = compute_rope_frequencies(dim, base)

    # Compute angles: m * θ_i for each position m
    angles = np.outer(positions, freqs)  # (seq_len, dim/2)

    # Split input into pairs
    x_even = x[:, 0::2]  # Even dimensions
    x_odd = x[:, 1::2]   # Odd dimensions

    # Apply rotation
    cos_angles = np.cos(angles)
    sin_angles = np.sin(angles)

    x_rotated_even = x_even * cos_angles - x_odd * sin_angles
    x_rotated_odd = x_even * sin_angles + x_odd * cos_angles

    # Interleave back
    result = np.empty_like(x)
    result[:, 0::2] = x_rotated_even
    result[:, 1::2] = x_rotated_odd

    return result

# Demonstrate RoPE
dim = 8
seq_len = 5

# Random query and key vectors
np.random.seed(42)
q = np.random.randn(seq_len, dim)
k = np.random.randn(seq_len, dim)

positions = np.arange(seq_len)

# Apply RoPE
q_rotated = apply_rope(q, positions)
k_rotated = apply_rope(k, positions)

# Verify: dot product depends on relative position
# q[2] · k[0] should equal q[5] · k[3] if x values are same
# (both have relative distance of 2)
print("Attention scores (raw):")
scores_raw = q @ k.T
print(scores_raw.round(3))

print("\nAttention scores (with RoPE):")
scores_rope = q_rotated @ k_rotated.T
print(scores_rope.round(3))

Why RoPE Breaks at Long Lengths

RoPE was trained with a specific maximum context length (e.g., 4K or 8K tokens). At positions beyond the training length, the rotation angles mθim\theta_i exceed the range seen during training.

For high-frequency components (θi\theta_i is large), the angles wrap around quickly — this is fine. But for low-frequency components (θi\theta_i is small), the angle at position nn is:

angle=n×θmin=n×100001\text{angle} = n \times \theta_{\min} = n \times 10000^{-1}

At n=4096n = 4096 (trained): angle =0.41= 0.41 radians At n=128000n = 128000 (extrapolating): angle =12.8= 12.8 radians

The model has never seen angles this large — it doesn’t know how to interpret them.

NTK-Aware Scaling

NTK-aware scaling (bloc97, 2023) modifies the base frequency to spread the same angular range over more positions:

θi=(base×α)2i/d\theta'_i = (\text{base} \times \alpha)^{-2i/d}

Where α=target_length/train_length\alpha = \text{target\_length} / \text{train\_length}.

This is equivalent to scaling the base:

base=base×αd/(d2)\text{base}' = \text{base} \times \alpha^{d/(d-2)}

def apply_rope_ntk_scaled(x, positions, base=10000.0,
                           train_length=4096, target_length=128000):
    """RoPE with NTK-aware scaling for length extension."""
    alpha = target_length / train_length
    # Scale the base frequency
    scaled_base = base * alpha ** (x.shape[1] / (x.shape[1] - 2))
    return apply_rope(x, positions, base=scaled_base)

ALiBi: A Simpler Alternative

Attention with Linear Biases (ALiBi) (Press et al., 2022) takes a completely different approach. Instead of modifying embeddings, it adds a linear penalty based on distance:

scoreij=qikjmij\text{score}_{ij} = q_i \cdot k_j - m \cdot |i - j|

Where mm is a head-specific slope. Each head has a different slope, allowing different heads to have different “attention spans.”

def alibi_bias(seq_len: int, n_heads: int) -> np.ndarray:
    """Compute ALiBi attention bias matrix."""
    # Slopes: geometric sequence from 2^(-8/n_heads) to 2^(-8)
    slopes = 2 ** (-8 * np.arange(1, n_heads + 1) / n_heads)

    # Distance matrix
    positions = np.arange(seq_len)
    distances = positions[:, None] - positions[None, :]  # (seq_len, seq_len)

    # Bias: -slope * |distance| for each head
    bias = np.zeros((n_heads, seq_len, seq_len))
    for h in range(n_heads):
        bias[h] = -slopes[h] * np.abs(distances)

    return bias

bias = alibi_bias(10, 8)
print(f"ALiBi bias shape: {bias.shape}")
print(f"Head 0, first 5×5:\n{bias[0, :5, :5].round(3)}")

ALiBi pros: No learned parameters, naturally extrapolates to any length ALiBi cons: Fixed linear decay may not capture all positional relationships

Comparison

MethodTypeLength ExtrapolationUsed By
SinusoidalAbsolutePoorOriginal Transformer
LearnedAbsoluteNoneGPT-2
RoPERelative (rotation)Good (with scaling)Llama, Mistral, Qwen
ALiBiRelative (bias)ExcellentMPT, BLOOM
NTK-RoPERelative (scaled rotation)Very goodExtended Llama
YaRNRelative (mixed scaling)Very goodMany fine-tuned models

RoPE dominates the field today because it offers the best combination of quality, flexibility, and extensibility.


ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai