Mar 31, 2026 prefill decode TTFT LLM latency inference optimization speculative decoding

Before the AI generates its first word, it must process EVERY token in the context. Here's why time-to-first-token increases with context length.

Prefill vs. Decode: Why Long Context Makes Your AI Slow

The Exam Analogy

Imagine you’re taking a test. The teacher hands you a 10-page reading passage, then asks a question. Before you can answer, you have to read all 10 pages. The longer the passage, the longer you wait before writing your first word.

Now imagine the passage is 1,000 pages. You’d sit there reading for hours before you could even start answering.

This is exactly what happens with AI. Before the model generates its first token of output, it must process every single token in the context. This is called the prefill phase, and it’s why long-context queries feel slow.

Two Phases of Inference

Every LLM inference has two distinct phases:

Phase 1: Prefill (Reading the Exam)

The model processes all input tokens in parallel. This is when:

All tokens are converted to embeddings
Attention is computed across all input tokens
The KV cache is populated for every token
The model is ready to generate

Characteristic: Compute-bound. The GPU is doing massive matrix multiplications.

Time complexity: $O(n^2 \cdot d)$ for self-attention, where $n$ is input length.

Phase 2: Decode (Writing the Answer)

The model generates tokens one at a time. Each new token:

Attends to all previous tokens (using the KV cache)
Generates the next token
Adds its own KV to the cache
Repeats until done

Characteristic: Memory-bound. The bottleneck is reading the KV cache from GPU memory.

Time per token: Roughly constant, regardless of context length.

Time to First Token (TTFT)

The metric users care about most: how long until the AI starts responding?

$\text{TTFT} = T_{\text{prefill}} + T_{\text{first\_decode}}$

Since $T_{\text{first\_decode}}$ is roughly constant (~20–50ms), TTFT is dominated by prefill time:

$\text{TTFT} \approx T_{\text{prefill}} \propto n$

For the attention computation during prefill:

$T_{\text{prefill}} = \frac{2 \cdot n^2 \cdot d \cdot L}{F_{\text{GPU}}}$

Where:

$n$ = number of input tokens
$d$ = model dimension
$L$ = number of layers
$F_{\text{GPU}}$ = GPU FLOPS (e.g., ~990 TFLOPS for H100 in FP16)

Practical TTFT Examples

For a model like Llama 3.1 70B on an H100:

Context Length	Approximate TTFT
1K tokens	~0.1 seconds
10K tokens	~0.5 seconds
100K tokens	~5 seconds
500K tokens	~25 seconds
1M tokens	~50+ seconds

That means with 1M tokens of context, the user waits almost a minute before seeing the first word of output.

The Asymmetry: Prefill vs. Decode Performance

import time
import numpy as np

def simulate_inference_timing(
    context_lengths: list,
    output_length: int = 500,
    prefill_rate: float = 50_000,  # tokens/sec during prefill
    decode_rate: float = 50,       # tokens/sec during decode
):
    """
    Simulate inference timing for different context lengths.

    prefill_rate: how fast the model processes input tokens
    decode_rate: how fast the model generates output tokens
    """
    print(f"{'Context':>10} {'Prefill':>10} {'Decode':>10} {'Total':>10} {'TTFT':>10}")
    print("=" * 55)

    for ctx_len in context_lengths:
        # Prefill time scales with context length
        # (simplified - actual scaling is superlinear due to attention)
        prefill_time = ctx_len / prefill_rate

        # Decode time is roughly constant for fixed output length
        decode_time = output_length / decode_rate

        total_time = prefill_time + decode_time
        ttft = prefill_time  # Time to first token ≈ prefill time

        print(f"{ctx_len:>10,} {prefill_time:>10.2f}s {decode_time:>10.2f}s "
              f"{total_time:>10.2f}s {ttft:>10.2f}s")


simulate_inference_timing(
    context_lengths=[1_000, 10_000, 50_000, 128_000, 500_000, 1_000_000],
    output_length=500,
)

Output:

   Context    Prefill     Decode      Total       TTFT
=======================================================
     1,000      0.02s     10.00s     10.02s      0.02s
    10,000      0.20s     10.00s     10.20s      0.20s
    50,000      1.00s     10.00s     11.00s      1.00s
   128,000      2.56s     10.00s     12.56s      2.56s
   500,000     10.00s     10.00s     20.00s     10.00s
 1,000,000     20.00s     10.00s     30.00s     20.00s

Key insight: decode time stays the same (fixed output length), but prefill time dominates at long contexts.

Why Prefill Is Compute-Bound and Decode Is Memory-Bound

Prefill: Compute-Bound

During prefill, all $n$ tokens are processed simultaneously. The main operation is matrix multiplication:

$\text{FLOPS}_{\text{prefill}} = 2 \times n \times d^2 \times L \times 3 + 2 \times n^2 \times d \times L$

The first term is for Q, K, V projections; the second is for attention. With large $n$ , the GPU’s compute units are fully utilized. The bottleneck is raw floating-point operations.

Arithmetic intensity (FLOPS per byte transferred) is high → compute-bound.

Decode: Memory-Bound

During decode, the model generates one token at a time. For each token:

$\text{FLOPS}_{\text{per\_token}} = 2 \times d^2 \times L \times 3 + 2 \times n \times d \times L$

But the model still needs to read:

All model weights: ~140 GB for a 70B model
The entire KV cache: grows with each token

The FLOPS per token are small relative to the data that must be read from GPU memory. The bottleneck is memory bandwidth, not compute.

$\text{Arithmetic intensity}_{\text{decode}} \approx \frac{2d}{p} \text{ FLOPS/byte}$

For $d = 8192$ and $p = 2$ (FP16): about 8,192 FLOPS/byte. GPU memory bandwidth is ~3 TB/s for H100, giving only ~50 tokens/second for a 70B model.

Chunked Prefill: A Partial Solution

Instead of processing all $n$ tokens at once (which requires $O(n^2)$ memory for attention), chunked prefill breaks the input into chunks:

def chunked_prefill(tokens, chunk_size=4096):
    """
    Process input tokens in chunks to reduce peak memory.

    Instead of computing attention over all n tokens at once
    (needing O(n²) memory), process in chunks of size c
    (needing O(c²) memory per chunk).
    """
    n = len(tokens)
    kv_cache = []

    for start in range(0, n, chunk_size):
        end = min(start + chunk_size, n)
        chunk = tokens[start:end]

        # Compute Q, K, V for this chunk
        q, k, v = compute_qkv(chunk)

        # Attend to all previous KV cache + current chunk
        # This is O(chunk_size × total_processed) instead of O(n²)
        output = attend(q, kv_cache + [(k, v)])

        # Add this chunk's KV to the cache
        kv_cache.append((k, v))

    return kv_cache

This doesn’t reduce total compute ( $O(n^2)$ is still required), but it reduces peak memory from $O(n^2)$ to $O(c \times n)$ where $c$ is the chunk size.

Speculative Decoding: Making Decode Faster

Since decode is memory-bound (the GPU is waiting for data), we can use a trick: generate multiple tokens at once using a smaller, faster “draft” model, then verify them with the large model.

$\text{Speedup} = \frac{k}{1 + (1 - \alpha) \times k}$

Where $k$ is the number of speculated tokens and $\alpha$ is the acceptance rate.

If the draft model proposes 5 tokens and 4 are accepted:

$\text{Speedup} = \frac{5}{1 + (1 - 0.8) \times 5} = \frac{5}{2} = 2.5\times$

def speculative_decode(large_model, small_model, context, k=5):
    """
    Speculative decoding: use small model to draft,
    large model to verify.
    """
    output = []

    while not is_done(output):
        # Step 1: Small model generates k draft tokens (fast)
        drafts = []
        draft_context = context + output
        for _ in range(k):
            token = small_model.generate_one(draft_context)
            drafts.append(token)
            draft_context = draft_context + [token]

        # Step 2: Large model verifies all k tokens in parallel
        # (parallel = one forward pass = like prefill, compute-bound)
        verified = large_model.verify(context + output, drafts)

        # Step 3: Accept tokens until first rejection
        for i, (draft, is_correct) in enumerate(zip(drafts, verified)):
            if is_correct:
                output.append(draft)
            else:
                # Replace with large model's token
                output.append(large_model.sample(context + output))
                break

    return output

The Practical Takeaway

Understanding prefill vs. decode has direct implications:

Long context = long wait. TTFT scales linearly (at minimum) with context length. Don’t stuff context unnecessarily.
Streaming helps UX but not TTFT. Even with streaming enabled, the user waits through the entire prefill before seeing anything.
Batch wisely. Prefill is compute-bound; decode is memory-bound. Mixing them on the same GPU is suboptimal — some systems separate prefill and decode to different GPU pools.
Cache system prompts. If every request shares a long system prompt, caching its KV avoids re-computing the prefill for those tokens.

$\text{TTFT}_{\text{cached}} = \frac{n_{\text{new}}}{r_{\text{prefill}}} \quad \text{vs.} \quad \text{TTFT}_{\text{uncached}} = \frac{n_{\text{total}}}{r_{\text{prefill}}}$

If your system prompt is 20K tokens and the user’s message is 1K tokens, caching gives a ~20× TTFT improvement.

ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Private Code Context retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai

← All posts