Mar 31, 2026 ring attention sequence parallelism distributed inference multi-GPU DeepSpeed

When a single GPU can't hold the KV cache, you distribute the sequence across multiple GPUs. Here's how ring attention enables million-token contexts.

Ring Attention and Sequence Parallelism

The Assembly Line Analogy

Imagine you’re building cars. One worker can’t build the entire car alone — it’s too complex and takes too long. So you create an assembly line: each worker handles one part, passes it to the next.

But what if you have a car that’s too long to fit in one workstation? You split the car into sections, with each workstation working on its section simultaneously, occasionally passing parts between stations.

That’s sequence parallelism. When the context (sequence) is too long for one GPU, you split it across multiple GPUs and coordinate the work.

Types of Parallelism in LLMs

There are three main ways to distribute LLM work across GPUs:

1. Tensor Parallelism (TP)

Split each layer’s weights across GPUs. Every GPU processes the full sequence but only part of each matrix multiplication.

$W = [W_1 | W_2 | \ldots | W_P] \quad \text{split columns across } P \text{ GPUs}$

Good for: Large models that don’t fit on one GPU Limitation: Doesn’t help with long sequences (each GPU still sees full context)

2. Pipeline Parallelism (PP)

Assign different layers to different GPUs. GPU 1 runs layers 1–20, GPU 2 runs layers 21–40, etc.

Good for: Very deep models Limitation: Sequential dependency between stages creates bubbles

3. Sequence Parallelism (SP)

Split the sequence across GPUs. Each GPU holds a chunk of the context.

$[t_1, \ldots, t_n] = \underbrace{[t_1, \ldots, t_{n/P}]}_{\text{GPU 1}} | \underbrace{[t_{n/P+1}, \ldots, t_{2n/P}]}_{\text{GPU 2}} | \ldots$

Good for: Long contexts that exceed single-GPU memory This is where Ring Attention comes in.

The Problem: Attention Needs Global Context

Self-attention requires every token to attend to every other token. If you split the sequence across GPUs, each GPU only has a portion of the keys and values. How does GPU 1 compute attention over keys that live on GPU 4?

$\alpha_{ij} = \frac{\exp(q_i \cdot k_j / \sqrt{d})}{\sum_{l=1}^{n} \exp(q_i \cdot k_l / \sqrt{d})}$

Token $i$ on GPU 1 needs to compute dot products with token $j$ on GPU 4. The keys must somehow travel between GPUs.

Ring Attention: The Solution

Ring Attention (Liu et al., 2023) arranges GPUs in a ring topology. Each GPU:

Holds its local chunk of Q, K, V
Computes attention between its Q and the currently available K, V
Passes its K, V to the next GPU in the ring
Receives K, V from the previous GPU
Repeats until it has seen all K, V blocks

def ring_attention(
    local_Q,      # This GPU's queries (n/P × d)
    local_K,      # This GPU's keys (n/P × d)
    local_V,      # This GPU's values (n/P × d)
    n_gpus,       # Number of GPUs in the ring
    gpu_rank,     # This GPU's rank (0 to P-1)
):
    """
    Ring Attention: each GPU computes attention over
    all K,V blocks by passing them around a ring.

    Total communication: each GPU sends/receives K,V
    (n_gpus - 1) times.
    """
    n_local = local_Q.shape[0]
    d = local_Q.shape[1]

    # Initialize output accumulator and softmax stats
    O = zeros(n_local, d)        # Output
    m = full(n_local, -float('inf'))  # Running max
    l = zeros(n_local)           # Running sum

    # Current K, V being processed
    current_K = local_K
    current_V = local_V

    for step in range(n_gpus):
        # ========================================
        # OVERLAP: Communication + Computation
        # ========================================

        # Start async send/receive (NON-BLOCKING)
        if step < n_gpus - 1:
            # Send current K, V to next GPU
            send_async(current_K, current_V, dest=(gpu_rank + 1) % n_gpus)
            # Receive K, V from previous GPU
            next_K, next_V = recv_async(src=(gpu_rank - 1) % n_gpus)

        # COMPUTE: attention between local Q and current K, V
        # This runs WHILE communication happens
        S = local_Q @ current_K.T / sqrt(d)  # (n_local × n_local)

        # Online softmax update (same as Flash Attention)
        m_new = maximum(m, S.max(dim=-1))
        correction = exp(m - m_new)
        P = exp(S - m_new.unsqueeze(-1))
        l_new = correction * l + P.sum(dim=-1)
        O = (correction.unsqueeze(-1) * l.unsqueeze(-1) * O + P @ current_V) / l_new.unsqueeze(-1)
        m = m_new
        l = l_new

        # Wait for communication to complete
        if step < n_gpus - 1:
            wait_communication()
            current_K = next_K
            current_V = next_V

    return O

The Key Innovation: Overlapping Communication and Computation

The magic of ring attention is that while GPU $i$ is computing attention over the current K,V block, it’s simultaneously sending that block to GPU $i+1$ and receiving a new block from GPU $i-1$ .

$T_{\text{step}} = \max(T_{\text{compute}}, T_{\text{communicate}})$

If computation takes longer than communication (which it usually does for large blocks), the communication is completely hidden:

$T_{\text{total}} = P \times T_{\text{compute}} \quad \text{(communication is free!)}$

Communication Cost Analysis

Each GPU sends its local K and V blocks $(n/P \times d)$ around the ring:

$\text{Data per send} = 2 \times \frac{n}{P} \times d \times p \text{ bytes}$

Total sends per GPU: $P - 1$ (receives every other GPU’s data exactly once).

$\text{Total communication} = (P - 1) \times 2 \times \frac{n}{P} \times d \times p$

For $P$ GPUs with NVLink bandwidth $B$ :

$T_{\text{comm}} = \frac{(P-1) \times 2 \times n \times d \times p}{P \times B}$

For 8 GPUs, $n = 1M$ , $d = 128$ , FP16 ( $p = 2$ ), NVLink at 900 GB/s:

$T_{\text{comm}} = \frac{7 \times 2 \times 10^6 \times 128 \times 2}{8 \times 900 \times 10^9} \approx 0.5 \text{ seconds}$

Memory Per GPU

With ring attention, each GPU stores:

Local Q, K, V: $3 \times (n/P) \times d \times p$ bytes
KV buffer (for receiving): $2 \times (n/P) \times d \times p$ bytes
Output accumulator: $(n/P) \times d \times p$ bytes
Softmax stats: $2 \times (n/P) \times p$ bytes

Total per GPU:

$M_{\text{per GPU}} \approx 6 \times \frac{n}{P} \times d \times p$

For 8 GPUs, $n = 1M$ , $d = 8192$ (model dim), FP16:

$M_{\text{per GPU}} \approx 6 \times \frac{10^6}{8} \times 8192 \times 2 \approx 12.3 \text{ GB}$

Compare to single-GPU: $6 \times 10^6 \times 8192 \times 2 \approx 98 \text{ GB}$ .

Ring attention reduces per-GPU memory by $P\times$ .

DeepSpeed Ulysses: An Alternative

DeepSpeed Ulysses takes a different approach to sequence parallelism:

Instead of passing K,V around a ring, it uses all-to-all communication to redistribute the sequence:

Each GPU has a chunk of the sequence with full Q, K, V
All-to-all redistributes so each GPU gets all positions but only some heads
Each GPU computes attention for its assigned heads
All-to-all redistributes back

$\text{Before all-to-all: GPU}_i \text{ has } [t_{i \cdot n/P} : t_{(i+1) \cdot n/P}] \text{ for all heads}$

$\text{After all-to-all: GPU}_i \text{ has all tokens for heads } [h_{i \cdot H/P} : h_{(i+1) \cdot H/P}]$

Ring Attention vs. DeepSpeed Ulysses:

Property	Ring Attention	DeepSpeed Ulysses
Communication pattern	Point-to-point (ring)	All-to-all
Communication volume	$O(n \cdot d)$	$O(n \cdot d)$
Overlap with compute	Yes (natural)	Harder
Implementation complexity	Moderate	Lower
Best for	Very long sequences	Moderate sequences with many heads

Scaling Limits

The maximum context length is now bounded by total GPU memory across all nodes:

$n_{\max} = \frac{P \times (M_{\text{GPU}} - M_{\text{weights}}/P' - M_{\text{overhead}})}{6 \times d \times p}$

Where $P'$ is the tensor parallelism degree (for model weight distribution).

With 64 H100 GPUs (80GB each), Llama 70B:

$n_{\max} \approx \frac{64 \times (80 - 140/8 - 3) \times 10^9}{6 \times 8192 \times 2} \approx 40M \text{ tokens}$

Theoretically, 40 million tokens of context — though attention dilution would make most of that useless.

The Practical State of Affairs

Today’s production systems typically combine:

Tensor Parallelism to split model weights (within a node)
Sequence Parallelism to handle long contexts (across nodes)
Flash Attention to minimize IO within each GPU
KV Cache Quantization to reduce memory per token

This stack enables the 1M+ context windows offered by frontier models — but the engineering complexity is enormous, which is why only a handful of companies can operate at this scale.

ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Private Code Context retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai

← All posts