Before the AI generates its first word, it must process EVERY token in the context. Here's why time-to-first-token increases with context length.
Imagine you’re taking a test. The teacher hands you a 10-page reading passage, then asks a question. Before you can answer, you have to read all 10 pages. The longer the passage, the longer you wait before writing your first word.
Now imagine the passage is 1,000 pages. You’d sit there reading for hours before you could even start answering.
This is exactly what happens with AI. Before the model generates its first token of output, it must process every single token in the context. This is called the prefill phase, and it’s why long-context queries feel slow.
Every LLM inference has two distinct phases:
The model processes all input tokens in parallel. This is when:
Characteristic: Compute-bound. The GPU is doing massive matrix multiplications.
Time complexity: for self-attention, where is input length.
The model generates tokens one at a time. Each new token:
Characteristic: Memory-bound. The bottleneck is reading the KV cache from GPU memory.
Time per token: Roughly constant, regardless of context length.
The metric users care about most: how long until the AI starts responding?
Since is roughly constant (~20–50ms), TTFT is dominated by prefill time:
For the attention computation during prefill:
Where:
For a model like Llama 3.1 70B on an H100:
| Context Length | Approximate TTFT |
|---|---|
| 1K tokens | ~0.1 seconds |
| 10K tokens | ~0.5 seconds |
| 100K tokens | ~5 seconds |
| 500K tokens | ~25 seconds |
| 1M tokens | ~50+ seconds |
That means with 1M tokens of context, the user waits almost a minute before seeing the first word of output.
import time
import numpy as np
def simulate_inference_timing(
context_lengths: list,
output_length: int = 500,
prefill_rate: float = 50_000, # tokens/sec during prefill
decode_rate: float = 50, # tokens/sec during decode
):
"""
Simulate inference timing for different context lengths.
prefill_rate: how fast the model processes input tokens
decode_rate: how fast the model generates output tokens
"""
print(f"{'Context':>10} {'Prefill':>10} {'Decode':>10} {'Total':>10} {'TTFT':>10}")
print("=" * 55)
for ctx_len in context_lengths:
# Prefill time scales with context length
# (simplified - actual scaling is superlinear due to attention)
prefill_time = ctx_len / prefill_rate
# Decode time is roughly constant for fixed output length
decode_time = output_length / decode_rate
total_time = prefill_time + decode_time
ttft = prefill_time # Time to first token ≈ prefill time
print(f"{ctx_len:>10,} {prefill_time:>10.2f}s {decode_time:>10.2f}s "
f"{total_time:>10.2f}s {ttft:>10.2f}s")
simulate_inference_timing(
context_lengths=[1_000, 10_000, 50_000, 128_000, 500_000, 1_000_000],
output_length=500,
)Output:
Context Prefill Decode Total TTFT
=======================================================
1,000 0.02s 10.00s 10.02s 0.02s
10,000 0.20s 10.00s 10.20s 0.20s
50,000 1.00s 10.00s 11.00s 1.00s
128,000 2.56s 10.00s 12.56s 2.56s
500,000 10.00s 10.00s 20.00s 10.00s
1,000,000 20.00s 10.00s 30.00s 20.00sKey insight: decode time stays the same (fixed output length), but prefill time dominates at long contexts.
During prefill, all tokens are processed simultaneously. The main operation is matrix multiplication:
The first term is for Q, K, V projections; the second is for attention. With large , the GPU’s compute units are fully utilized. The bottleneck is raw floating-point operations.
Arithmetic intensity (FLOPS per byte transferred) is high → compute-bound.
During decode, the model generates one token at a time. For each token:
But the model still needs to read:
The FLOPS per token are small relative to the data that must be read from GPU memory. The bottleneck is memory bandwidth, not compute.
For and (FP16): about 8,192 FLOPS/byte. GPU memory bandwidth is ~3 TB/s for H100, giving only ~50 tokens/second for a 70B model.
Instead of processing all tokens at once (which requires memory for attention), chunked prefill breaks the input into chunks:
def chunked_prefill(tokens, chunk_size=4096):
"""
Process input tokens in chunks to reduce peak memory.
Instead of computing attention over all n tokens at once
(needing O(n²) memory), process in chunks of size c
(needing O(c²) memory per chunk).
"""
n = len(tokens)
kv_cache = []
for start in range(0, n, chunk_size):
end = min(start + chunk_size, n)
chunk = tokens[start:end]
# Compute Q, K, V for this chunk
q, k, v = compute_qkv(chunk)
# Attend to all previous KV cache + current chunk
# This is O(chunk_size × total_processed) instead of O(n²)
output = attend(q, kv_cache + [(k, v)])
# Add this chunk's KV to the cache
kv_cache.append((k, v))
return kv_cacheThis doesn’t reduce total compute ( is still required), but it reduces peak memory from to where is the chunk size.
Since decode is memory-bound (the GPU is waiting for data), we can use a trick: generate multiple tokens at once using a smaller, faster “draft” model, then verify them with the large model.
Where is the number of speculated tokens and is the acceptance rate.
If the draft model proposes 5 tokens and 4 are accepted:
def speculative_decode(large_model, small_model, context, k=5):
"""
Speculative decoding: use small model to draft,
large model to verify.
"""
output = []
while not is_done(output):
# Step 1: Small model generates k draft tokens (fast)
drafts = []
draft_context = context + output
for _ in range(k):
token = small_model.generate_one(draft_context)
drafts.append(token)
draft_context = draft_context + [token]
# Step 2: Large model verifies all k tokens in parallel
# (parallel = one forward pass = like prefill, compute-bound)
verified = large_model.verify(context + output, drafts)
# Step 3: Accept tokens until first rejection
for i, (draft, is_correct) in enumerate(zip(drafts, verified)):
if is_correct:
output.append(draft)
else:
# Replace with large model's token
output.append(large_model.sample(context + output))
break
return outputUnderstanding prefill vs. decode has direct implications:
Long context = long wait. TTFT scales linearly (at minimum) with context length. Don’t stuff context unnecessarily.
Streaming helps UX but not TTFT. Even with streaming enabled, the user waits through the entire prefill before seeing anything.
Batch wisely. Prefill is compute-bound; decode is memory-bound. Mixing them on the same GPU is suboptimal — some systems separate prefill and decode to different GPU pools.
Cache system prompts. If every request shares a long system prompt, caching its KV avoids re-computing the prefill for those tokens.
If your system prompt is 20K tokens and the user’s message is 1K tokens, caching gives a ~20× TTFT improvement.
ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai