How the Transformer Attention Mechanism Actually Works
Start Simple: The Library Analogy
Imagine you’re in a library looking for information about “climate change effects on coral reefs.” You have three tools:
- Your question (what you’re looking for) → This is the Query (Q)
- The catalog cards (what each book is about) → These are the Keys (K)
- The actual books (the information itself) → These are the Values (V)
Here’s how you search:
- You compare your question against every catalog card to find which books are relevant
- The most relevant books get your full attention; irrelevant books get ignored
- You read the actual content from the relevant books
That’s attention in a nutshell. The transformer does this for every single token it generates — asking “which parts of the context are most relevant right now?”
Query, Key, Value: The Three Vectors
Every token in the context gets transformed into three vectors through learned weight matrices:
Where:
- is the input embedding (a vector representing the token)
- are learned weight matrices (the model learns these during training)
- Each is of shape: , where is the model dimension and is the head dimension
The Attention Formula
The complete attention computation in one equation:
Let’s break this down step by step.
Step 1: Compute Relevance Scores ()
Multiply every query against every key to get a “relevance score”:
This creates an matrix where entry tells us how relevant token is when generating token .
For tokens (“The cat sat down”):
Step 2: Scale by
Without scaling, the dot products can grow very large (proportional to ), pushing softmax into regions where gradients are tiny. The scaling keeps things numerically stable:
Why ? If and have entries with variance 1, then has variance . Dividing by brings the variance back to 1.
Step 3: Apply Softmax (Normalize)
Softmax converts raw scores into a probability distribution (all positive, summing to 1):
Now each row sums to 1, and represents the fraction of attention token pays to token .
Step 4: Weighted Sum of Values
Finally, use the attention weights to create a weighted combination of Value vectors:
Each output token is a blend of all Value vectors, weighted by their relevance.
See It in Code
Here’s attention implemented from scratch in NumPy:
import numpy as np
def scaled_dot_product_attention(Q, K, V):
"""
Compute scaled dot-product attention.
Args:
Q: Query matrix, shape (n, d_k)
K: Key matrix, shape (n, d_k)
V: Value matrix, shape (n, d_v)
Returns:
Output: Weighted values, shape (n, d_v)
Attention weights: shape (n, n)
"""
d_k = K.shape[-1]
# Step 1: Compute relevance scores
scores = Q @ K.T # (n, n)
# Step 2: Scale
scores = scores / np.sqrt(d_k)
# Step 3: Softmax (numerically stable version)
scores_max = scores.max(axis=-1, keepdims=True)
exp_scores = np.exp(scores - scores_max)
attention_weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)
# Step 4: Weighted sum of values
output = attention_weights @ V # (n, d_v)
return output, attention_weights
# Example: 4 tokens, dimension 8
np.random.seed(42)
n_tokens = 4
d_k = 8
# Random embeddings for "The cat sat down"
X = np.random.randn(n_tokens, d_k)
# Random weight matrices (in real models, these are learned)
W_Q = np.random.randn(d_k, d_k) * 0.1
W_K = np.random.randn(d_k, d_k) * 0.1
W_V = np.random.randn(d_k, d_k) * 0.1
# Project to Q, K, V
Q = X @ W_Q
K = X @ W_K
V = X @ W_V
# Compute attention
output, weights = scaled_dot_product_attention(Q, K, V)
print("Attention Weights (which tokens attend to which):")
tokens = ["The", "cat", "sat", "down"]
print(f"{'':>8}", end="")
for t in tokens:
print(f"{t:>8}", end="")
print()
for i, row in enumerate(weights):
print(f"{tokens[i]:>8}", end="")
for w in row:
print(f"{w:>8.3f}", end="")
print()Example output:
Attention Weights (which tokens attend to which):
The cat sat down
The 0.312 0.198 0.241 0.249
cat 0.215 0.354 0.189 0.242
sat 0.267 0.203 0.301 0.229
down 0.238 0.271 0.217 0.274Multi-Head Attention: Looking from Multiple Angles
One attention head captures one “type” of relevance. But language has many types of relationships — syntactic, semantic, positional, etc.
Multi-head attention runs multiple attention operations in parallel, each with its own weight matrices:
Where each head has a smaller dimension:
If the model dimension is and we use heads, each head operates on dimensions.
def multi_head_attention(X, n_heads, d_model):
"""Multi-head attention with n_heads."""
d_k = d_model // n_heads
heads = []
for i in range(n_heads):
# Each head has its own projection matrices
W_Q = np.random.randn(d_model, d_k) * 0.1
W_K = np.random.randn(d_model, d_k) * 0.1
W_V = np.random.randn(d_model, d_k) * 0.1
Q = X @ W_Q
K = X @ W_K
V = X @ W_V
head_output, _ = scaled_dot_product_attention(Q, K, V)
heads.append(head_output)
# Concatenate all heads
concat = np.concatenate(heads, axis=-1) # (n, d_model)
# Final projection
W_O = np.random.randn(d_model, d_model) * 0.1
output = concat @ W_O
return outputEach head learns to focus on different things:
- Head 1 might track subject-verb agreement
- Head 3 might track pronoun references
- Head 7 might track code block boundaries
- Head 12 might track numerical relationships
Self-Attention vs. Cross-Attention
Self-attention: Q, K, and V all come from the same sequence. The text attends to itself. This is what happens inside each transformer layer.
Cross-attention: Q comes from one sequence, K and V from another. Used in encoder-decoder models (like machine translation) where the decoder attends to the encoder’s output.
Most modern LLMs (GPT, Claude, Llama) use only self-attention — they’re decoder-only models.
Why Attention Is the Bottleneck
The multiplication creates an matrix. For a 200K token context:
That’s 40 billion attention scores per head, per layer, per forward pass. At FP16 (2 bytes each):
Just storing the attention matrix for one head of one layer would need 80 GB of GPU memory. This is why techniques like Flash Attention (which never materializes the full matrix) are essential.
The Computational Complexity
For tokens and head dimension :
- : multiply
- Softmax:
- Multiply by :
Total per head:
With heads where :
This scaling is the fundamental reason why context windows are expensive and why the field is actively researching linear attention alternatives.
The Key Insight
Every token competes for attention. Attention is a finite resource that sums to 1. More tokens = less attention per token. This isn’t a bug — it’s a mathematical consequence of how softmax works.
Understanding this explains why AI gets worse with longer context, why important information should go at the start or end, and why smart context management is more important than just making context windows bigger.
ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Private Code Context retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai