The attention mechanism is the beating heart of every LLM. Here's how it decides which parts of your conversation matter most — explained with analogies before equations.
Imagine you’re in a library looking for information about “climate change effects on coral reefs.” You have three tools:
Here’s how you search:
That’s attention in a nutshell. The transformer does this for every single token it generates — asking “which parts of the context are most relevant right now?”
Every token in the context gets transformed into three vectors through learned weight matrices:
Where:
The complete attention computation in one equation:
Let’s break this down step by step.
Multiply every query against every key to get a “relevance score”:
This creates an matrix where entry tells us how relevant token is when generating token .
For tokens (“The cat sat down”):
Without scaling, the dot products can grow very large (proportional to ), pushing softmax into regions where gradients are tiny. The scaling keeps things numerically stable:
Why ? If and have entries with variance 1, then has variance . Dividing by brings the variance back to 1.
Softmax converts raw scores into a probability distribution (all positive, summing to 1):
Now each row sums to 1, and represents the fraction of attention token pays to token .
Finally, use the attention weights to create a weighted combination of Value vectors:
Each output token is a blend of all Value vectors, weighted by their relevance.
Here’s attention implemented from scratch in NumPy:
import numpy as np
def scaled_dot_product_attention(Q, K, V):
"""
Compute scaled dot-product attention.
Args:
Q: Query matrix, shape (n, d_k)
K: Key matrix, shape (n, d_k)
V: Value matrix, shape (n, d_v)
Returns:
Output: Weighted values, shape (n, d_v)
Attention weights: shape (n, n)
"""
d_k = K.shape[-1]
# Step 1: Compute relevance scores
scores = Q @ K.T # (n, n)
# Step 2: Scale
scores = scores / np.sqrt(d_k)
# Step 3: Softmax (numerically stable version)
scores_max = scores.max(axis=-1, keepdims=True)
exp_scores = np.exp(scores - scores_max)
attention_weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)
# Step 4: Weighted sum of values
output = attention_weights @ V # (n, d_v)
return output, attention_weights
# Example: 4 tokens, dimension 8
np.random.seed(42)
n_tokens = 4
d_k = 8
# Random embeddings for "The cat sat down"
X = np.random.randn(n_tokens, d_k)
# Random weight matrices (in real models, these are learned)
W_Q = np.random.randn(d_k, d_k) * 0.1
W_K = np.random.randn(d_k, d_k) * 0.1
W_V = np.random.randn(d_k, d_k) * 0.1
# Project to Q, K, V
Q = X @ W_Q
K = X @ W_K
V = X @ W_V
# Compute attention
output, weights = scaled_dot_product_attention(Q, K, V)
print("Attention Weights (which tokens attend to which):")
tokens = ["The", "cat", "sat", "down"]
print(f"{'':>8}", end="")
for t in tokens:
print(f"{t:>8}", end="")
print()
for i, row in enumerate(weights):
print(f"{tokens[i]:>8}", end="")
for w in row:
print(f"{w:>8.3f}", end="")
print()Example output:
Attention Weights (which tokens attend to which):
The cat sat down
The 0.312 0.198 0.241 0.249
cat 0.215 0.354 0.189 0.242
sat 0.267 0.203 0.301 0.229
down 0.238 0.271 0.217 0.274One attention head captures one “type” of relevance. But language has many types of relationships — syntactic, semantic, positional, etc.
Multi-head attention runs multiple attention operations in parallel, each with its own weight matrices:
Where each head has a smaller dimension:
If the model dimension is and we use heads, each head operates on dimensions.
def multi_head_attention(X, n_heads, d_model):
"""Multi-head attention with n_heads."""
d_k = d_model // n_heads
heads = []
for i in range(n_heads):
# Each head has its own projection matrices
W_Q = np.random.randn(d_model, d_k) * 0.1
W_K = np.random.randn(d_model, d_k) * 0.1
W_V = np.random.randn(d_model, d_k) * 0.1
Q = X @ W_Q
K = X @ W_K
V = X @ W_V
head_output, _ = scaled_dot_product_attention(Q, K, V)
heads.append(head_output)
# Concatenate all heads
concat = np.concatenate(heads, axis=-1) # (n, d_model)
# Final projection
W_O = np.random.randn(d_model, d_model) * 0.1
output = concat @ W_O
return outputEach head learns to focus on different things:
Self-attention: Q, K, and V all come from the same sequence. The text attends to itself. This is what happens inside each transformer layer.
Cross-attention: Q comes from one sequence, K and V from another. Used in encoder-decoder models (like machine translation) where the decoder attends to the encoder’s output.
Most modern LLMs (GPT, Claude, Llama) use only self-attention — they’re decoder-only models.
The multiplication creates an matrix. For a 200K token context:
That’s 40 billion attention scores per head, per layer, per forward pass. At FP16 (2 bytes each):
Just storing the attention matrix for one head of one layer would need 80 GB of GPU memory. This is why techniques like Flash Attention (which never materializes the full matrix) are essential.
For tokens and head dimension :
Total per head:
With heads where :
This scaling is the fundamental reason why context windows are expensive and why the field is actively researching linear attention alternatives.
Every token competes for attention. Attention is a finite resource that sums to 1. More tokens = less attention per token. This isn’t a bug — it’s a mathematical consequence of how softmax works.
Understanding this explains why AI gets worse with longer context, why important information should go at the start or end, and why smart context management is more important than just making context windows bigger.
ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai