Mar 31, 2026 GPU memory memory wall VRAM requirements context length limits tensor parallelism

Total GPU memory = model weights + KV cache + activations + workspace. Here's the exact formula to compute maximum context length for any GPU configuration.

The Quadratic Memory Wall

The Moving Truck Analogy

Imagine moving to a new apartment. You have a truck (GPU) with a fixed capacity. The furniture (model weights) takes up a constant amount of space. But then you need to load boxes (KV cache) — and the number of boxes grows with every item you own. At some point, the truck is full and you can’t fit another box no matter how hard you try.

That’s the memory wall. For every model and GPU combination, there’s a maximum context length beyond which nothing fits. This blog derives the exact formula.

Memory Decomposition

Total GPU memory usage during inference:

$M_{\text{total}} = M_{\text{weights}} + M_{\text{KV}}(s) + M_{\text{activations}}(s) + M_{\text{workspace}}$

Component	Scales With	Typical Size
$M_{\text{weights}}$	Model parameters	Fixed (16–810 GB)
$M_{\text{KV}}(s)$	Sequence length $s$	Linear in $s$
$M_{\text{activations}}(s)$	Sequence length $s$	Linear in $s$
$M_{\text{workspace}}$	Constant	2–4 GB

Model Weights

$M_{\text{weights}} = P \times p$

Where $P$ = number of parameters and $p$ = bytes per parameter.

Model	Parameters	FP16 (GB)	INT8 (GB)	INT4 (GB)
Llama 3.1 8B	8B	16	8	4
Llama 3.1 70B	70B	140	70	35
Llama 3.1 405B	405B	810	405	203

KV Cache (Derived Previously)

$M_{\text{KV}}(s) = 2 \times L \times h_{kv} \times d_h \times s \times b \times p_{kv}$

Activations

During inference, activations for the current layer need to be stored:

$M_{\text{activations}}(s) \approx L_{\text{active}} \times s \times d \times b \times p$

With activation checkpointing (recomputing instead of storing), $L_{\text{active}} = 1$ (only one layer’s activations at a time):

$M_{\text{activations}}(s) \approx s \times d \times b \times p$

Workspace

Operating system overhead, CUDA context, temporary buffers:

$M_{\text{workspace}} \approx 2\text{–}4 \text{ GB}$

Deriving Maximum Context Length

Given GPU memory $M_{\text{GPU}}$ , solve for maximum $s$ :

$M_{\text{GPU}} \geq M_{\text{weights}} + M_{\text{KV}}(s) + M_{\text{activations}}(s) + M_{\text{workspace}}$

$M_{\text{GPU}} - M_{\text{weights}} - M_{\text{workspace}} \geq s \times (2 L h_{kv} d_h b p_{kv} + d \cdot b \cdot p)$

$s_{\max} = \frac{M_{\text{GPU}} - M_{\text{weights}} - M_{\text{workspace}}}{2 L h_{kv} d_h b p_{kv} + d \cdot b \cdot p}$

The denominator is the bytes per token — a fixed cost for each additional token in context.

import numpy as np

# Model architectures
MODELS = {
    "Llama-3.1-8B": {
        "params_b": 8, "layers": 32, "kv_heads": 8,
        "head_dim": 128, "model_dim": 4096,
    },
    "Llama-3.1-70B": {
        "params_b": 70, "layers": 80, "kv_heads": 8,
        "head_dim": 128, "model_dim": 8192,
    },
    "Llama-3.1-405B": {
        "params_b": 405, "layers": 126, "kv_heads": 8,
        "head_dim": 128, "model_dim": 16384,
    },
}

# GPU configurations
GPUS = {
    "A100-40GB": {"memory_gb": 40, "count": 1},
    "A100-80GB": {"memory_gb": 80, "count": 1},
    "H100-80GB": {"memory_gb": 80, "count": 1},
    "2×H100": {"memory_gb": 160, "count": 2},
    "4×H100": {"memory_gb": 320, "count": 4},
    "8×H100": {"memory_gb": 640, "count": 8},
    "H200-141GB": {"memory_gb": 141, "count": 1},
    "8×H200": {"memory_gb": 1128, "count": 8},
}


def max_context(model: dict, gpu_memory_gb: float,
                weight_precision: int = 2, kv_precision: int = 2,
                batch_size: int = 1, workspace_gb: float = 3.0) -> int:
    """Calculate maximum context length for a model-GPU combination."""
    # Weight memory
    weight_bytes = model["params_b"] * 1e9 * weight_precision
    weight_gb = weight_bytes / (1024**3)

    # Available memory for KV cache + activations
    available_gb = gpu_memory_gb - weight_gb - workspace_gb
    if available_gb <= 0:
        return 0

    available_bytes = available_gb * (1024**3)

    # Bytes per token
    kv_bytes_per_token = (
        2 * model["layers"] * model["kv_heads"] * model["head_dim"]
        * batch_size * kv_precision
    )
    act_bytes_per_token = model["model_dim"] * batch_size * weight_precision

    bytes_per_token = kv_bytes_per_token + act_bytes_per_token

    return int(available_bytes / bytes_per_token)


# Generate comprehensive table
print(f"{'Model':<18} {'GPU':<14} {'Wt Prec':<8} {'Max Context':>12} {'~Pages':>8}")
print("=" * 65)

for model_name, model in MODELS.items():
    for gpu_name, gpu in GPUS.items():
        for w_prec, prec_name in [(2, "FP16"), (1, "INT8")]:
            s = max_context(model, gpu["memory_gb"],
                           weight_precision=w_prec, kv_precision=2)
            pages = int(s * 0.75 / 250) if s > 0 else 0
            s_str = f"{s:,}" if s > 0 else "N/A"
            print(f"{model_name:<18} {gpu_name:<14} {prec_name:<8} {s_str:>12} {pages:>8,}")
    print("-" * 65)

The Crossover Point

The memory crossover is where KV cache equals model weight memory:

$M_{\text{KV}}(s_{\text{cross}}) = M_{\text{weights}}$

$s_{\text{cross}} = \frac{P \times p_w}{2 \times L \times h_{kv} \times d_h \times p_{kv}}$

def crossover_point(model: dict, weight_prec: int = 2, kv_prec: int = 2) -> int:
    """Context length where KV cache = model weight memory."""
    weight_bytes = model["params_b"] * 1e9 * weight_prec
    kv_bytes_per_token = 2 * model["layers"] * model["kv_heads"] * model["head_dim"] * kv_prec
    return int(weight_bytes / kv_bytes_per_token)

print("\nMemory Crossover Points (KV cache = model weights):")
print(f"{'Model':<20} {'Crossover (tokens)':>20} {'Crossover (pages)':>18}")
print("=" * 60)
for name, model in MODELS.items():
    s = crossover_point(model)
    pages = int(s * 0.75 / 250)
    print(f"{name:<20} {s:>20,} {pages:>18,}")

Cost Per Token as a Function of Context

The cost per output token increases with context because prefill and KV cache access scale with $s$ :

$\text{Cost per output token} = C_{\text{base}} + C_{\text{KV\_read}} \times s$

Where:

$C_{\text{base}}$ = cost of generating one token (reading weights, compute)
$C_{\text{KV\_read}} \times s$ = cost of reading KV cache entries for attention

This means the marginal cost of one additional output token is:

$\frac{\partial \text{Cost}}{\partial t_{\text{output}}} = C_{\text{base}} + C_{\text{KV\_read}} \times s$

And the cost of context itself (prefill):

$\text{Prefill cost} \propto s^2 \times d$

def cost_analysis(s_values, price_per_mtok_input=3.0, price_per_mtok_output=15.0):
    """Analyze API cost as a function of context length."""
    print(f"{'Context':>10} {'Input Cost':>12} {'Output Cost':>12} {'Total/query':>12}")
    print(f"{'(tokens)':>10} {'($)':>12} {'(500 tok, $)':>12} {'($)':>12}")
    print("=" * 50)

    for s in s_values:
        input_cost = s / 1e6 * price_per_mtok_input
        output_cost = 500 / 1e6 * price_per_mtok_output
        total = input_cost + output_cost
        print(f"{s:>10,} {input_cost:>12.4f} {output_cost:>12.4f} {total:>12.4f}")

cost_analysis([1_000, 10_000, 50_000, 128_000, 500_000, 1_000_000])

Multi-GPU Scaling

With tensor parallelism across $P$ GPUs, model weights are split:

$M_{\text{weights per GPU}} = \frac{M_{\text{weights}}}{P}$

But KV cache is replicated on each GPU (each GPU needs the full KV cache for its attention computation):

$M_{\text{KV per GPU}} = M_{\text{KV}}(s) \quad \text{(full copy)}$

Therefore, the maximum context with tensor parallelism:

$s_{\max}^{TP} = \frac{M_{\text{GPU}} - M_{\text{weights}}/P - M_{\text{workspace}}}{2 L h_{kv} d_h p_{kv} + d \cdot p / P}$

Note: KV cache doesn’t decrease with $P$ because each GPU computes full attention. Only model weight memory and activation memory decrease.

For sequence parallelism (ring attention), KV cache IS split:

$s_{\max}^{SP} = P \times \frac{M_{\text{GPU}} - M_{\text{weights}}/P - M_{\text{workspace}}}{2 L h_{kv} d_h p_{kv} + d \cdot p}$

This scales linearly with $P$ — adding more GPUs directly increases maximum context.

The Bottom Line

Every model-GPU combination has a hard maximum context length. Knowing the formula lets you:

Budget hardware for your target context length
Choose the right quantization to extend context within memory limits
Decide between tensor and sequence parallelism based on whether you need longer context or more throughput
Calculate the ROI of adding GPUs vs. reducing context via retrieval

The memory wall is physics. You can’t argue with it — but you can plan around it.

ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Private Code Context retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai

← All posts