How the Transformer Attention Mechanism Actually Works (No Math Required)

The attention mechanism is the beating heart of every LLM. Here's how it decides which parts of your conversation matter most — explained with analogies before equations.

How the Transformer Attention Mechanism Actually Works (No Math Required)

How the Transformer Attention Mechanism Actually Works

Start Simple: The Library Analogy

Imagine you’re in a library looking for information about “climate change effects on coral reefs.” You have three tools:

  1. Your question (what you’re looking for) → This is the Query (Q)
  2. The catalog cards (what each book is about) → These are the Keys (K)
  3. The actual books (the information itself) → These are the Values (V)

Here’s how you search:

  1. You compare your question against every catalog card to find which books are relevant
  2. The most relevant books get your full attention; irrelevant books get ignored
  3. You read the actual content from the relevant books

That’s attention in a nutshell. The transformer does this for every single token it generates — asking “which parts of the context are most relevant right now?”

Query, Key, Value: The Three Vectors

Every token in the context gets transformed into three vectors through learned weight matrices:

Q=XWQ(Query: "What am I looking for?")Q = X \cdot W_Q \quad \text{(Query: "What am I looking for?")}

K=XWK(Key: "What do I contain?")K = X \cdot W_K \quad \text{(Key: "What do I contain?")}

V=XWV(Value: "Here’s my actual information")V = X \cdot W_V \quad \text{(Value: "Here's my actual information")}

Where:

The Attention Formula

The complete attention computation in one equation:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

Let’s break this down step by step.

Step 1: Compute Relevance Scores (QKTQK^T)

Multiply every query against every key to get a “relevance score”:

S=QKTRn×nS = QK^T \in \mathbb{R}^{n \times n}

This creates an n×nn \times n matrix where entry SijS_{ij} tells us how relevant token jj is when generating token ii.

For n=4n = 4 tokens (“The cat sat down”):

S=[s11s12s13s14s21s22s23s24s31s32s33s34s41s42s43s44]S = \begin{bmatrix} s_{11} & s_{12} & s_{13} & s_{14} \\ s_{21} & s_{22} & s_{23} & s_{24} \\ s_{31} & s_{32} & s_{33} & s_{34} \\ s_{41} & s_{42} & s_{43} & s_{44} \end{bmatrix}

Step 2: Scale by dk\sqrt{d_k}

Without scaling, the dot products can grow very large (proportional to dkd_k), pushing softmax into regions where gradients are tiny. The scaling keeps things numerically stable:

Sscaled=SdkS_{\text{scaled}} = \frac{S}{\sqrt{d_k}}

Why dk\sqrt{d_k}? If QQ and KK have entries with variance 1, then QKQ \cdot K has variance dkd_k. Dividing by dk\sqrt{d_k} brings the variance back to 1.

Step 3: Apply Softmax (Normalize)

Softmax converts raw scores into a probability distribution (all positive, summing to 1):

αij=exp(Sij/dk)l=1nexp(Sil/dk)\alpha_{ij} = \frac{\exp(S_{ij} / \sqrt{d_k})}{\sum_{l=1}^{n} \exp(S_{il} / \sqrt{d_k})}

Now each row sums to 1, and αij\alpha_{ij} represents the fraction of attention token ii pays to token jj.

Step 4: Weighted Sum of Values

Finally, use the attention weights to create a weighted combination of Value vectors:

Outputi=j=1nαijVj\text{Output}_i = \sum_{j=1}^{n} \alpha_{ij} \cdot V_j

Each output token is a blend of all Value vectors, weighted by their relevance.

See It in Code

Here’s attention implemented from scratch in NumPy:

import numpy as np

def scaled_dot_product_attention(Q, K, V):
    """
    Compute scaled dot-product attention.

    Args:
        Q: Query matrix, shape (n, d_k)
        K: Key matrix, shape (n, d_k)
        V: Value matrix, shape (n, d_v)

    Returns:
        Output: Weighted values, shape (n, d_v)
        Attention weights: shape (n, n)
    """
    d_k = K.shape[-1]

    # Step 1: Compute relevance scores
    scores = Q @ K.T  # (n, n)

    # Step 2: Scale
    scores = scores / np.sqrt(d_k)

    # Step 3: Softmax (numerically stable version)
    scores_max = scores.max(axis=-1, keepdims=True)
    exp_scores = np.exp(scores - scores_max)
    attention_weights = exp_scores / exp_scores.sum(axis=-1, keepdims=True)

    # Step 4: Weighted sum of values
    output = attention_weights @ V  # (n, d_v)

    return output, attention_weights


# Example: 4 tokens, dimension 8
np.random.seed(42)
n_tokens = 4
d_k = 8

# Random embeddings for "The cat sat down"
X = np.random.randn(n_tokens, d_k)

# Random weight matrices (in real models, these are learned)
W_Q = np.random.randn(d_k, d_k) * 0.1
W_K = np.random.randn(d_k, d_k) * 0.1
W_V = np.random.randn(d_k, d_k) * 0.1

# Project to Q, K, V
Q = X @ W_Q
K = X @ W_K
V = X @ W_V

# Compute attention
output, weights = scaled_dot_product_attention(Q, K, V)

print("Attention Weights (which tokens attend to which):")
tokens = ["The", "cat", "sat", "down"]
print(f"{'':>8}", end="")
for t in tokens:
    print(f"{t:>8}", end="")
print()

for i, row in enumerate(weights):
    print(f"{tokens[i]:>8}", end="")
    for w in row:
        print(f"{w:>8.3f}", end="")
    print()

Example output:

Attention Weights (which tokens attend to which):
              The     cat     sat    down
     The   0.312   0.198   0.241   0.249
     cat   0.215   0.354   0.189   0.242
     sat   0.267   0.203   0.301   0.229
    down   0.238   0.271   0.217   0.274

Multi-Head Attention: Looking from Multiple Angles

One attention head captures one “type” of relevance. But language has many types of relationships — syntactic, semantic, positional, etc.

Multi-head attention runs multiple attention operations in parallel, each with its own weight matrices:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \cdot W_O

Where each head has a smaller dimension:

headi=Attention(QWQi,KWKi,VWVi)\text{head}_i = \text{Attention}(Q W_Q^i, K W_K^i, V W_V^i)

If the model dimension is d=768d = 768 and we use h=12h = 12 heads, each head operates on dk=768/12=64d_k = 768 / 12 = 64 dimensions.

def multi_head_attention(X, n_heads, d_model):
    """Multi-head attention with n_heads."""
    d_k = d_model // n_heads
    heads = []

    for i in range(n_heads):
        # Each head has its own projection matrices
        W_Q = np.random.randn(d_model, d_k) * 0.1
        W_K = np.random.randn(d_model, d_k) * 0.1
        W_V = np.random.randn(d_model, d_k) * 0.1

        Q = X @ W_Q
        K = X @ W_K
        V = X @ W_V

        head_output, _ = scaled_dot_product_attention(Q, K, V)
        heads.append(head_output)

    # Concatenate all heads
    concat = np.concatenate(heads, axis=-1)  # (n, d_model)

    # Final projection
    W_O = np.random.randn(d_model, d_model) * 0.1
    output = concat @ W_O

    return output

Each head learns to focus on different things:

Self-Attention vs. Cross-Attention

Self-attention: Q, K, and V all come from the same sequence. The text attends to itself. This is what happens inside each transformer layer.

Self-Attention: Q=XWQ,K=XWK,V=XWV\text{Self-Attention: } Q = XW_Q, \quad K = XW_K, \quad V = XW_V

Cross-attention: Q comes from one sequence, K and V from another. Used in encoder-decoder models (like machine translation) where the decoder attends to the encoder’s output.

Cross-Attention: Q=XdecoderWQ,K=XencoderWK,V=XencoderWV\text{Cross-Attention: } Q = X_{\text{decoder}}W_Q, \quad K = X_{\text{encoder}}W_K, \quad V = X_{\text{encoder}}W_V

Most modern LLMs (GPT, Claude, Llama) use only self-attention — they’re decoder-only models.

Why Attention Is the Bottleneck

The QKTQK^T multiplication creates an n×nn \times n matrix. For a 200K token context:

200,000×200,000=40,000,000,000=40 billion entries200{,}000 \times 200{,}000 = 40{,}000{,}000{,}000 = 40 \text{ billion entries}

That’s 40 billion attention scores per head, per layer, per forward pass. At FP16 (2 bytes each):

40×109×2 bytes=80 GB40 \times 10^9 \times 2 \text{ bytes} = 80 \text{ GB}

Just storing the attention matrix for one head of one layer would need 80 GB of GPU memory. This is why techniques like Flash Attention (which never materializes the full matrix) are essential.

The Computational Complexity

For nn tokens and head dimension dkd_k:

  1. QKTQK^T: multiply (n×dk)×(dk×n)=O(n2dk)(n \times d_k) \times (d_k \times n) = O(n^2 d_k)
  2. Softmax: O(n2)O(n^2)
  3. Multiply by VV: (n×n)×(n×dv)=O(n2dv)(n \times n) \times (n \times d_v) = O(n^2 d_v)

Total per head: O(n2dk)O(n^2 d_k)

With hh heads where dk=d/hd_k = d/h:

Total=h×O(n2×d/h)=O(n2d)\text{Total} = h \times O(n^2 \times d/h) = O(n^2 d)

This O(n2)O(n^2) scaling is the fundamental reason why context windows are expensive and why the field is actively researching linear attention alternatives.

The Key Insight

Every token competes for attention. Attention is a finite resource that sums to 1. More tokens = less attention per token. This isn’t a bug — it’s a mathematical consequence of how softmax works.

Understanding this explains why AI gets worse with longer context, why important information should go at the start or end, and why smart context management is more important than just making context windows bigger.


ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai