Mar 31, 2026 transformer attention mechanism query key value self-attention LLM architecture

The attention mechanism is the beating heart of every LLM. Here's how it decides which parts of your conversation matter most — explained with analogies before equations.

How the Transformer Attention Mechanism Actually Works

Start Simple: The Library Analogy

Imagine you’re in a library looking for information about “climate change effects on coral reefs.” You have three tools:

Your question (what you’re looking for) → This is the Query (Q)
The catalog cards (what each book is about) → These are the Keys (K)
The actual books (the information itself) → These are the Values (V)

Here’s how you search:

You compare your question against every catalog card to find which books are relevant
The most relevant books get your full attention; irrelevant books get ignored
You read the actual content from the relevant books

That’s attention in a nutshell. The transformer does this for every single token it generates — asking “which parts of the context are most relevant right now?”

Query, Key, Value: The Three Vectors

Every token in the context gets transformed into three vectors through learned weight matrices:

$Q = X \cdot W_Q \quad \text{(Query: "What am I looking for?")}$

$K = X \cdot W_K \quad \text{(Key: "What do I contain?")}$

$V = X \cdot W_V \quad \text{(Value: "Here's my actual information")}$

Where:

$X$ is the input embedding (a vector representing the token)
$W_Q, W_K, W_V$ are learned weight matrices (the model learns these during training)
Each is of shape: $W \in \mathbb{R}^{d \times d_k}$ , where $d$ is the model dimension and $d_k$ is the head dimension

The Attention Formula

The complete attention computation in one equation:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$

Let’s break this down step by step.

Step 1: Compute Relevance Scores ( $QK^T$ )

Multiply every query against every key to get a “relevance score”:

$S = QK^T \in \mathbb{R}^{n \times n}$

This creates an $n \times n$ matrix where entry $S_{ij}$ tells us how relevant token $j$ is when generating token $i$ .

For $n = 4$ tokens (“The cat sat down”):

$S = \begin{bmatrix} s_{11} & s_{12} & s_{13} & s_{14} \\ s_{21} & s_{22} & s_{23} & s_{24} \\ s_{31} & s_{32} & s_{33} & s_{34} \\ s_{41} & s_{42} & s_{43} & s_{44} \end{bmatrix}$

Step 2: Scale by $\sqrt{d_k}$

Without scaling, the dot products can grow very large (proportional to $d_k$ ), pushing softmax into regions where gradients are tiny. The scaling keeps things numerically stable:

$S_{\text{scaled}} = \frac{S}{\sqrt{d_k}}$

Why $\sqrt{d_k}$ ? If $Q$ and $K$ have entries with variance 1, then $Q \cdot K$ has variance $d_k$ . Dividing by $\sqrt{d_k}$ brings the variance back to 1.

Step 3: Apply Softmax (Normalize)

Softmax converts raw scores into a probability distribution (all positive, summing to 1):

$\alpha_{ij} = \frac{\exp(S_{ij} / \sqrt{d_k})}{\sum_{l=1}^{n} \exp(S_{il} / \sqrt{d_k})}$

Now each row sums to 1, and $\alpha_{ij}$ represents the fraction of attention token $i$ pays to token $j$ .

Step 4: Weighted Sum of Values

Finally, use the attention weights to create a weighted combination of Value vectors:

$\text{Output}_i = \sum_{j=1}^{n} \alpha_{ij} \cdot V_j$

Each output token is a blend of all Value vectors, weighted by their relevance.

See It in Code

Here’s attention implemented from scratch in NumPy:

import numpy as np

def scaled_dot_product_attention(Q, K, V):
    """
    Compute scaled dot-product attention.

    Args:
        Q: Query matrix, shape (n, d_k)
        K: Key matrix, shape (n, d_k)
        V: Value matrix, shape (n, d_v)

    Returns:
        Output: Weighted values, shape (n, d_v)
        Attention weights: shape (n, n)
    """
    d_k = K.shape[-1]

    # Step 1: Compute relevance scores
    scores = Q @ K.T  # (n, n)

    # Step 2: Scale
    scores = scores / np.sqrt(d_k)

    # Step 3: Softmax (numerically stable version)
    scores_max = scores.max(axis=-1, keepdims=True)
    exp_scores = np.exp(scores - scores_max)
    attention_weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)

    # Step 4: Weighted sum of values
    output = attention_weights @ V  # (n, d_v)

    return output, attention_weights


# Example: 4 tokens, dimension 8
np.random.seed(42)
n_tokens = 4
d_k = 8

# Random embeddings for "The cat sat down"
X = np.random.randn(n_tokens, d_k)

# Random weight matrices (in real models, these are learned)
W_Q = np.random.randn(d_k, d_k) * 0.1
W_K = np.random.randn(d_k, d_k) * 0.1
W_V = np.random.randn(d_k, d_k) * 0.1

# Project to Q, K, V
Q = X @ W_Q
K = X @ W_K
V = X @ W_V

# Compute attention
output, weights = scaled_dot_product_attention(Q, K, V)

print("Attention Weights (which tokens attend to which):")
tokens = ["The", "cat", "sat", "down"]
print(f"{'':>8}", end="")
for t in tokens:
    print(f"{t:>8}", end="")
print()

for i, row in enumerate(weights):
    print(f"{tokens[i]:>8}", end="")
    for w in row:
        print(f"{w:>8.3f}", end="")
    print()

Example output:

Attention Weights (which tokens attend to which):
              The     cat     sat    down
     The   0.312   0.198   0.241   0.249
     cat   0.215   0.354   0.189   0.242
     sat   0.267   0.203   0.301   0.229
    down   0.238   0.271   0.217   0.274

Multi-Head Attention: Looking from Multiple Angles

One attention head captures one “type” of relevance. But language has many types of relationships — syntactic, semantic, positional, etc.

Multi-head attention runs multiple attention operations in parallel, each with its own weight matrices:

$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \cdot W_O$

Where each head has a smaller dimension:

$\text{head}_i = \text{Attention}(Q W_Q^i, K W_K^i, V W_V^i)$

If the model dimension is $d = 768$ and we use $h = 12$ heads, each head operates on $d_k = 768 / 12 = 64$ dimensions.

def multi_head_attention(X, n_heads, d_model):
    """Multi-head attention with n_heads."""
    d_k = d_model // n_heads
    heads = []

    for i in range(n_heads):
        # Each head has its own projection matrices
        W_Q = np.random.randn(d_model, d_k) * 0.1
        W_K = np.random.randn(d_model, d_k) * 0.1
        W_V = np.random.randn(d_model, d_k) * 0.1

        Q = X @ W_Q
        K = X @ W_K
        V = X @ W_V

        head_output, _ = scaled_dot_product_attention(Q, K, V)
        heads.append(head_output)

    # Concatenate all heads
    concat = np.concatenate(heads, axis=-1)  # (n, d_model)

    # Final projection
    W_O = np.random.randn(d_model, d_model) * 0.1
    output = concat @ W_O

    return output

Each head learns to focus on different things:

Head 1 might track subject-verb agreement
Head 3 might track pronoun references
Head 7 might track code block boundaries
Head 12 might track numerical relationships

Self-Attention vs. Cross-Attention

Self-attention: Q, K, and V all come from the same sequence. The text attends to itself. This is what happens inside each transformer layer.

$\text{Self-Attention: } Q = XW_Q, \quad K = XW_K, \quad V = XW_V$

Cross-attention: Q comes from one sequence, K and V from another. Used in encoder-decoder models (like machine translation) where the decoder attends to the encoder’s output.

$\text{Cross-Attention: } Q = X_{\text{decoder}}W_Q, \quad K = X_{\text{encoder}}W_K, \quad V = X_{\text{encoder}}W_V$

Most modern LLMs (GPT, Claude, Llama) use only self-attention — they’re decoder-only models.

Why Attention Is the Bottleneck

The $QK^T$ multiplication creates an $n \times n$ matrix. For a 200K token context:

$200{,}000 \times 200{,}000 = 40{,}000{,}000{,}000 = 40 \text{ billion entries}$

That’s 40 billion attention scores per head, per layer, per forward pass. At FP16 (2 bytes each):

$40 \times 10^9 \times 2 \text{ bytes} = 80 \text{ GB}$

Just storing the attention matrix for one head of one layer would need 80 GB of GPU memory. This is why techniques like Flash Attention (which never materializes the full matrix) are essential.

The Computational Complexity

For $n$ tokens and head dimension $d_k$ :

$QK^T$ : multiply $(n \times d_k) \times (d_k \times n) = O(n^2 d_k)$
Softmax: $O(n^2)$
Multiply by $V$ : $(n \times n) \times (n \times d_v) = O(n^2 d_v)$

Total per head: $O(n^2 d_k)$

With $h$ heads where $d_k = d/h$ :

$\text{Total} = h \times O(n^2 \times d/h) = O(n^2 d)$

This $O(n^2)$ scaling is the fundamental reason why context windows are expensive and why the field is actively researching linear attention alternatives.

The Key Insight

Every token competes for attention. Attention is a finite resource that sums to 1. More tokens = less attention per token. This isn’t a bug — it’s a mathematical consequence of how softmax works.

Understanding this explains why AI gets worse with longer context, why important information should go at the start or end, and why smart context management is more important than just making context windows bigger.

ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Private Code Context retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai

← All posts