Total GPU memory = model weights + KV cache + activations + workspace. Here's the exact formula to compute maximum context length for any GPU configuration.
Imagine moving to a new apartment. You have a truck (GPU) with a fixed capacity. The furniture (model weights) takes up a constant amount of space. But then you need to load boxes (KV cache) — and the number of boxes grows with every item you own. At some point, the truck is full and you can’t fit another box no matter how hard you try.
That’s the memory wall. For every model and GPU combination, there’s a maximum context length beyond which nothing fits. This blog derives the exact formula.
Total GPU memory usage during inference:
| Component | Scales With | Typical Size |
|---|---|---|
| Model parameters | Fixed (16–810 GB) | |
| Sequence length | Linear in | |
| Sequence length | Linear in | |
| Constant | 2–4 GB |
Where = number of parameters and = bytes per parameter.
| Model | Parameters | FP16 (GB) | INT8 (GB) | INT4 (GB) |
|---|---|---|---|---|
| Llama 3.1 8B | 8B | 16 | 8 | 4 |
| Llama 3.1 70B | 70B | 140 | 70 | 35 |
| Llama 3.1 405B | 405B | 810 | 405 | 203 |
During inference, activations for the current layer need to be stored:
With activation checkpointing (recomputing instead of storing), (only one layer’s activations at a time):
Operating system overhead, CUDA context, temporary buffers:
Given GPU memory , solve for maximum :
The denominator is the bytes per token — a fixed cost for each additional token in context.
import numpy as np
# Model architectures
MODELS = {
"Llama-3.1-8B": {
"params_b": 8, "layers": 32, "kv_heads": 8,
"head_dim": 128, "model_dim": 4096,
},
"Llama-3.1-70B": {
"params_b": 70, "layers": 80, "kv_heads": 8,
"head_dim": 128, "model_dim": 8192,
},
"Llama-3.1-405B": {
"params_b": 405, "layers": 126, "kv_heads": 8,
"head_dim": 128, "model_dim": 16384,
},
}
# GPU configurations
GPUS = {
"A100-40GB": {"memory_gb": 40, "count": 1},
"A100-80GB": {"memory_gb": 80, "count": 1},
"H100-80GB": {"memory_gb": 80, "count": 1},
"2×H100": {"memory_gb": 160, "count": 2},
"4×H100": {"memory_gb": 320, "count": 4},
"8×H100": {"memory_gb": 640, "count": 8},
"H200-141GB": {"memory_gb": 141, "count": 1},
"8×H200": {"memory_gb": 1128, "count": 8},
}
def max_context(model: dict, gpu_memory_gb: float,
weight_precision: int = 2, kv_precision: int = 2,
batch_size: int = 1, workspace_gb: float = 3.0) -> int:
"""Calculate maximum context length for a model-GPU combination."""
# Weight memory
weight_bytes = model["params_b"] * 1e9 * weight_precision
weight_gb = weight_bytes / (1024**3)
# Available memory for KV cache + activations
available_gb = gpu_memory_gb - weight_gb - workspace_gb
if available_gb <= 0:
return 0
available_bytes = available_gb * (1024**3)
# Bytes per token
kv_bytes_per_token = (
2 * model["layers"] * model["kv_heads"] * model["head_dim"]
* batch_size * kv_precision
)
act_bytes_per_token = model["model_dim"] * batch_size * weight_precision
bytes_per_token = kv_bytes_per_token + act_bytes_per_token
return int(available_bytes / bytes_per_token)
# Generate comprehensive table
print(f"{'Model':<18} {'GPU':<14} {'Wt Prec':<8} {'Max Context':>12} {'~Pages':>8}")
print("=" * 65)
for model_name, model in MODELS.items():
for gpu_name, gpu in GPUS.items():
for w_prec, prec_name in [(2, "FP16"), (1, "INT8")]:
s = max_context(model, gpu["memory_gb"],
weight_precision=w_prec, kv_precision=2)
pages = int(s * 0.75 / 250) if s > 0 else 0
s_str = f"{s:,}" if s > 0 else "N/A"
print(f"{model_name:<18} {gpu_name:<14} {prec_name:<8} {s_str:>12} {pages:>8,}")
print("-" * 65)The memory crossover is where KV cache equals model weight memory:
def crossover_point(model: dict, weight_prec: int = 2, kv_prec: int = 2) -> int:
"""Context length where KV cache = model weight memory."""
weight_bytes = model["params_b"] * 1e9 * weight_prec
kv_bytes_per_token = 2 * model["layers"] * model["kv_heads"] * model["head_dim"] * kv_prec
return int(weight_bytes / kv_bytes_per_token)
print("\nMemory Crossover Points (KV cache = model weights):")
print(f"{'Model':<20} {'Crossover (tokens)':>20} {'Crossover (pages)':>18}")
print("=" * 60)
for name, model in MODELS.items():
s = crossover_point(model)
pages = int(s * 0.75 / 250)
print(f"{name:<20} {s:>20,} {pages:>18,}")The cost per output token increases with context because prefill and KV cache access scale with :
Where:
This means the marginal cost of one additional output token is:
And the cost of context itself (prefill):
def cost_analysis(s_values, price_per_mtok_input=3.0, price_per_mtok_output=15.0):
"""Analyze API cost as a function of context length."""
print(f"{'Context':>10} {'Input Cost':>12} {'Output Cost':>12} {'Total/query':>12}")
print(f"{'(tokens)':>10} {'($)':>12} {'(500 tok, $)':>12} {'($)':>12}")
print("=" * 50)
for s in s_values:
input_cost = s / 1e6 * price_per_mtok_input
output_cost = 500 / 1e6 * price_per_mtok_output
total = input_cost + output_cost
print(f"{s:>10,} {input_cost:>12.4f} {output_cost:>12.4f} {total:>12.4f}")
cost_analysis([1_000, 10_000, 50_000, 128_000, 500_000, 1_000_000])With tensor parallelism across GPUs, model weights are split:
But KV cache is replicated on each GPU (each GPU needs the full KV cache for its attention computation):
Therefore, the maximum context with tensor parallelism:
Note: KV cache doesn’t decrease with because each GPU computes full attention. Only model weight memory and activation memory decrease.
For sequence parallelism (ring attention), KV cache IS split:
This scales linearly with — adding more GPUs directly increases maximum context.
Every model-GPU combination has a hard maximum context length. Knowing the formula lets you:
The memory wall is physics. You can’t argue with it — but you can plan around it.
ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai