The Quadratic Memory Wall: A Precise Analysis of GPU Memory Requirements per Context Length

Total GPU memory = model weights + KV cache + activations + workspace. Here's the exact formula to compute maximum context length for any GPU configuration.

The Quadratic Memory Wall: A Precise Analysis of GPU Memory Requirements per Context Length

The Quadratic Memory Wall

The Moving Truck Analogy

Imagine moving to a new apartment. You have a truck (GPU) with a fixed capacity. The furniture (model weights) takes up a constant amount of space. But then you need to load boxes (KV cache) — and the number of boxes grows with every item you own. At some point, the truck is full and you can’t fit another box no matter how hard you try.

That’s the memory wall. For every model and GPU combination, there’s a maximum context length beyond which nothing fits. This blog derives the exact formula.

Memory Decomposition

Total GPU memory usage during inference:

Mtotal=Mweights+MKV(s)+Mactivations(s)+MworkspaceM_{\text{total}} = M_{\text{weights}} + M_{\text{KV}}(s) + M_{\text{activations}}(s) + M_{\text{workspace}}

ComponentScales WithTypical Size
MweightsM_{\text{weights}}Model parametersFixed (16–810 GB)
MKV(s)M_{\text{KV}}(s)Sequence length ssLinear in ss
Mactivations(s)M_{\text{activations}}(s)Sequence length ssLinear in ss
MworkspaceM_{\text{workspace}}Constant2–4 GB

Model Weights

Mweights=P×pM_{\text{weights}} = P \times p

Where PP = number of parameters and pp = bytes per parameter.

ModelParametersFP16 (GB)INT8 (GB)INT4 (GB)
Llama 3.1 8B8B1684
Llama 3.1 70B70B1407035
Llama 3.1 405B405B810405203

KV Cache (Derived Previously)

MKV(s)=2×L×hkv×dh×s×b×pkvM_{\text{KV}}(s) = 2 \times L \times h_{kv} \times d_h \times s \times b \times p_{kv}

Activations

During inference, activations for the current layer need to be stored:

Mactivations(s)Lactive×s×d×b×pM_{\text{activations}}(s) \approx L_{\text{active}} \times s \times d \times b \times p

With activation checkpointing (recomputing instead of storing), Lactive=1L_{\text{active}} = 1 (only one layer’s activations at a time):

Mactivations(s)s×d×b×pM_{\text{activations}}(s) \approx s \times d \times b \times p

Workspace

Operating system overhead, CUDA context, temporary buffers:

Mworkspace24 GBM_{\text{workspace}} \approx 2\text{–}4 \text{ GB}

Deriving Maximum Context Length

Given GPU memory MGPUM_{\text{GPU}}, solve for maximum ss:

MGPUMweights+MKV(s)+Mactivations(s)+MworkspaceM_{\text{GPU}} \geq M_{\text{weights}} + M_{\text{KV}}(s) + M_{\text{activations}}(s) + M_{\text{workspace}}

MGPUMweightsMworkspaces×(2Lhkvdhbpkv+dbp)M_{\text{GPU}} - M_{\text{weights}} - M_{\text{workspace}} \geq s \times (2 L h_{kv} d_h b p_{kv} + d \cdot b \cdot p)

smax=MGPUMweightsMworkspace2Lhkvdhbpkv+dbps_{\max} = \frac{M_{\text{GPU}} - M_{\text{weights}} - M_{\text{workspace}}}{2 L h_{kv} d_h b p_{kv} + d \cdot b \cdot p}

The denominator is the bytes per token — a fixed cost for each additional token in context.

import numpy as np

# Model architectures
MODELS = {
    "Llama-3.1-8B": {
        "params_b": 8, "layers": 32, "kv_heads": 8,
        "head_dim": 128, "model_dim": 4096,
    },
    "Llama-3.1-70B": {
        "params_b": 70, "layers": 80, "kv_heads": 8,
        "head_dim": 128, "model_dim": 8192,
    },
    "Llama-3.1-405B": {
        "params_b": 405, "layers": 126, "kv_heads": 8,
        "head_dim": 128, "model_dim": 16384,
    },
}

# GPU configurations
GPUS = {
    "A100-40GB": {"memory_gb": 40, "count": 1},
    "A100-80GB": {"memory_gb": 80, "count": 1},
    "H100-80GB": {"memory_gb": 80, "count": 1},
    "2×H100": {"memory_gb": 160, "count": 2},
    "4×H100": {"memory_gb": 320, "count": 4},
    "8×H100": {"memory_gb": 640, "count": 8},
    "H200-141GB": {"memory_gb": 141, "count": 1},
    "8×H200": {"memory_gb": 1128, "count": 8},
}


def max_context(model: dict, gpu_memory_gb: float,
                weight_precision: int = 2, kv_precision: int = 2,
                batch_size: int = 1, workspace_gb: float = 3.0) -> int:
    """Calculate maximum context length for a model-GPU combination."""
    # Weight memory
    weight_bytes = model["params_b"] * 1e9 * weight_precision
    weight_gb = weight_bytes / (1024**3)

    # Available memory for KV cache + activations
    available_gb = gpu_memory_gb - weight_gb - workspace_gb
    if available_gb <= 0:
        return 0

    available_bytes = available_gb * (1024**3)

    # Bytes per token
    kv_bytes_per_token = (
        2 * model["layers"] * model["kv_heads"] * model["head_dim"]
        * batch_size * kv_precision
    )
    act_bytes_per_token = model["model_dim"] * batch_size * weight_precision

    bytes_per_token = kv_bytes_per_token + act_bytes_per_token

    return int(available_bytes / bytes_per_token)


# Generate comprehensive table
print(f"{'Model':<18} {'GPU':<14} {'Wt Prec':<8} {'Max Context':>12} {'~Pages':>8}")
print("=" * 65)

for model_name, model in MODELS.items():
    for gpu_name, gpu in GPUS.items():
        for w_prec, prec_name in [(2, "FP16"), (1, "INT8")]:
            s = max_context(model, gpu["memory_gb"],
                           weight_precision=w_prec, kv_precision=2)
            pages = int(s * 0.75 / 250) if s > 0 else 0
            s_str = f"{s:,}" if s > 0 else "N/A"
            print(f"{model_name:<18} {gpu_name:<14} {prec_name:<8} {s_str:>12} {pages:>8,}")
    print("-" * 65)

The Crossover Point

The memory crossover is where KV cache equals model weight memory:

MKV(scross)=MweightsM_{\text{KV}}(s_{\text{cross}}) = M_{\text{weights}}

scross=P×pw2×L×hkv×dh×pkvs_{\text{cross}} = \frac{P \times p_w}{2 \times L \times h_{kv} \times d_h \times p_{kv}}

def crossover_point(model: dict, weight_prec: int = 2, kv_prec: int = 2) -> int:
    """Context length where KV cache = model weight memory."""
    weight_bytes = model["params_b"] * 1e9 * weight_prec
    kv_bytes_per_token = 2 * model["layers"] * model["kv_heads"] * model["head_dim"] * kv_prec
    return int(weight_bytes / kv_bytes_per_token)

print("\nMemory Crossover Points (KV cache = model weights):")
print(f"{'Model':<20} {'Crossover (tokens)':>20} {'Crossover (pages)':>18}")
print("=" * 60)
for name, model in MODELS.items():
    s = crossover_point(model)
    pages = int(s * 0.75 / 250)
    print(f"{name:<20} {s:>20,} {pages:>18,}")

Cost Per Token as a Function of Context

The cost per output token increases with context because prefill and KV cache access scale with ss:

Cost per output token=Cbase+CKV_read×s\text{Cost per output token} = C_{\text{base}} + C_{\text{KV\_read}} \times s

Where:

This means the marginal cost of one additional output token is:

Costtoutput=Cbase+CKV_read×s\frac{\partial \text{Cost}}{\partial t_{\text{output}}} = C_{\text{base}} + C_{\text{KV\_read}} \times s

And the cost of context itself (prefill):

Prefill costs2×d\text{Prefill cost} \propto s^2 \times d

def cost_analysis(s_values, price_per_mtok_input=3.0, price_per_mtok_output=15.0):
    """Analyze API cost as a function of context length."""
    print(f"{'Context':>10} {'Input Cost':>12} {'Output Cost':>12} {'Total/query':>12}")
    print(f"{'(tokens)':>10} {'($)':>12} {'(500 tok, $)':>12} {'($)':>12}")
    print("=" * 50)

    for s in s_values:
        input_cost = s / 1e6 * price_per_mtok_input
        output_cost = 500 / 1e6 * price_per_mtok_output
        total = input_cost + output_cost
        print(f"{s:>10,} {input_cost:>12.4f} {output_cost:>12.4f} {total:>12.4f}")

cost_analysis([1_000, 10_000, 50_000, 128_000, 500_000, 1_000_000])

Multi-GPU Scaling

With tensor parallelism across PP GPUs, model weights are split:

Mweights per GPU=MweightsPM_{\text{weights per GPU}} = \frac{M_{\text{weights}}}{P}

But KV cache is replicated on each GPU (each GPU needs the full KV cache for its attention computation):

MKV per GPU=MKV(s)(full copy)M_{\text{KV per GPU}} = M_{\text{KV}}(s) \quad \text{(full copy)}

Therefore, the maximum context with tensor parallelism:

smaxTP=MGPUMweights/PMworkspace2Lhkvdhpkv+dp/Ps_{\max}^{TP} = \frac{M_{\text{GPU}} - M_{\text{weights}}/P - M_{\text{workspace}}}{2 L h_{kv} d_h p_{kv} + d \cdot p / P}

Note: KV cache doesn’t decrease with PP because each GPU computes full attention. Only model weight memory and activation memory decrease.

For sequence parallelism (ring attention), KV cache IS split:

smaxSP=P×MGPUMweights/PMworkspace2Lhkvdhpkv+dps_{\max}^{SP} = P \times \frac{M_{\text{GPU}} - M_{\text{weights}}/P - M_{\text{workspace}}}{2 L h_{kv} d_h p_{kv} + d \cdot p}

This scales linearly with PP — adding more GPUs directly increases maximum context.

The Bottom Line

Every model-GPU combination has a hard maximum context length. Knowing the formula lets you:

  1. Budget hardware for your target context length
  2. Choose the right quantization to extend context within memory limits
  3. Decide between tensor and sequence parallelism based on whether you need longer context or more throughput
  4. Calculate the ROI of adding GPUs vs. reducing context via retrieval

The memory wall is physics. You can’t argue with it — but you can plan around it.


ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai