Ring Attention and Sequence Parallelism: Distributing Context Across GPUs

When a single GPU can't hold the KV cache, you distribute the sequence across multiple GPUs. Here's how ring attention enables million-token contexts.

Ring Attention and Sequence Parallelism: Distributing Context Across GPUs

Ring Attention and Sequence Parallelism

The Assembly Line Analogy

Imagine you’re building cars. One worker can’t build the entire car alone — it’s too complex and takes too long. So you create an assembly line: each worker handles one part, passes it to the next.

But what if you have a car that’s too long to fit in one workstation? You split the car into sections, with each workstation working on its section simultaneously, occasionally passing parts between stations.

That’s sequence parallelism. When the context (sequence) is too long for one GPU, you split it across multiple GPUs and coordinate the work.

Types of Parallelism in LLMs

There are three main ways to distribute LLM work across GPUs:

1. Tensor Parallelism (TP)

Split each layer’s weights across GPUs. Every GPU processes the full sequence but only part of each matrix multiplication.

W=[W1W2WP]split columns across P GPUsW = [W_1 | W_2 | \ldots | W_P] \quad \text{split columns across } P \text{ GPUs}

Good for: Large models that don’t fit on one GPU Limitation: Doesn’t help with long sequences (each GPU still sees full context)

2. Pipeline Parallelism (PP)

Assign different layers to different GPUs. GPU 1 runs layers 1–20, GPU 2 runs layers 21–40, etc.

Good for: Very deep models Limitation: Sequential dependency between stages creates bubbles

3. Sequence Parallelism (SP)

Split the sequence across GPUs. Each GPU holds a chunk of the context.

[t1,,tn]=[t1,,tn/P]GPU 1[tn/P+1,,t2n/P]GPU 2[t_1, \ldots, t_n] = \underbrace{[t_1, \ldots, t_{n/P}]}_{\text{GPU 1}} | \underbrace{[t_{n/P+1}, \ldots, t_{2n/P}]}_{\text{GPU 2}} | \ldots

Good for: Long contexts that exceed single-GPU memory This is where Ring Attention comes in.

The Problem: Attention Needs Global Context

Self-attention requires every token to attend to every other token. If you split the sequence across GPUs, each GPU only has a portion of the keys and values. How does GPU 1 compute attention over keys that live on GPU 4?

αij=exp(qikj/d)l=1nexp(qikl/d)\alpha_{ij} = \frac{\exp(q_i \cdot k_j / \sqrt{d})}{\sum_{l=1}^{n} \exp(q_i \cdot k_l / \sqrt{d})}

Token ii on GPU 1 needs to compute dot products with token jj on GPU 4. The keys must somehow travel between GPUs.

Ring Attention: The Solution

Ring Attention (Liu et al., 2023) arranges GPUs in a ring topology. Each GPU:

  1. Holds its local chunk of Q, K, V
  2. Computes attention between its Q and the currently available K, V
  3. Passes its K, V to the next GPU in the ring
  4. Receives K, V from the previous GPU
  5. Repeats until it has seen all K, V blocks
def ring_attention(
    local_Q,      # This GPU's queries (n/P × d)
    local_K,      # This GPU's keys (n/P × d)
    local_V,      # This GPU's values (n/P × d)
    n_gpus,       # Number of GPUs in the ring
    gpu_rank,     # This GPU's rank (0 to P-1)
):
    """
    Ring Attention: each GPU computes attention over
    all K,V blocks by passing them around a ring.

    Total communication: each GPU sends/receives K,V
    (n_gpus - 1) times.
    """
    n_local = local_Q.shape[0]
    d = local_Q.shape[1]

    # Initialize output accumulator and softmax stats
    O = zeros(n_local, d)        # Output
    m = full(n_local, -float('inf'))  # Running max
    l = zeros(n_local)           # Running sum

    # Current K, V being processed
    current_K = local_K
    current_V = local_V

    for step in range(n_gpus):
        # ========================================
        # OVERLAP: Communication + Computation
        # ========================================

        # Start async send/receive (NON-BLOCKING)
        if step < n_gpus - 1:
            # Send current K, V to next GPU
            send_async(current_K, current_V, dest=(gpu_rank + 1) % n_gpus)
            # Receive K, V from previous GPU
            next_K, next_V = recv_async(src=(gpu_rank - 1) % n_gpus)

        # COMPUTE: attention between local Q and current K, V
        # This runs WHILE communication happens
        S = local_Q @ current_K.T / sqrt(d)  # (n_local × n_local)

        # Online softmax update (same as Flash Attention)
        m_new = maximum(m, S.max(dim=-1))
        correction = exp(m - m_new)
        P = exp(S - m_new.unsqueeze(-1))
        l_new = correction * l + P.sum(dim=-1)
        O = (correction.unsqueeze(-1) * l.unsqueeze(-1) * O + P @ current_V) / l_new.unsqueeze(-1)
        m = m_new
        l = l_new

        # Wait for communication to complete
        if step < n_gpus - 1:
            wait_communication()
            current_K = next_K
            current_V = next_V

    return O

The Key Innovation: Overlapping Communication and Computation

The magic of ring attention is that while GPU ii is computing attention over the current K,V block, it’s simultaneously sending that block to GPU i+1i+1 and receiving a new block from GPU i1i-1.

Tstep=max(Tcompute,Tcommunicate)T_{\text{step}} = \max(T_{\text{compute}}, T_{\text{communicate}})

If computation takes longer than communication (which it usually does for large blocks), the communication is completely hidden:

Ttotal=P×Tcompute(communication is free!)T_{\text{total}} = P \times T_{\text{compute}} \quad \text{(communication is free!)}

Communication Cost Analysis

Each GPU sends its local K and V blocks (n/P×d)(n/P \times d) around the ring:

Data per send=2×nP×d×p bytes\text{Data per send} = 2 \times \frac{n}{P} \times d \times p \text{ bytes}

Total sends per GPU: P1P - 1 (receives every other GPU’s data exactly once).

Total communication=(P1)×2×nP×d×p\text{Total communication} = (P - 1) \times 2 \times \frac{n}{P} \times d \times p

For PP GPUs with NVLink bandwidth BB:

Tcomm=(P1)×2×n×d×pP×BT_{\text{comm}} = \frac{(P-1) \times 2 \times n \times d \times p}{P \times B}

For 8 GPUs, n=1Mn = 1M, d=128d = 128, FP16 (p=2p = 2), NVLink at 900 GB/s:

Tcomm=7×2×106×128×28×900×1090.5 secondsT_{\text{comm}} = \frac{7 \times 2 \times 10^6 \times 128 \times 2}{8 \times 900 \times 10^9} \approx 0.5 \text{ seconds}

Memory Per GPU

With ring attention, each GPU stores:

  1. Local Q, K, V: 3×(n/P)×d×p3 \times (n/P) \times d \times p bytes
  2. KV buffer (for receiving): 2×(n/P)×d×p2 \times (n/P) \times d \times p bytes
  3. Output accumulator: (n/P)×d×p(n/P) \times d \times p bytes
  4. Softmax stats: 2×(n/P)×p2 \times (n/P) \times p bytes

Total per GPU:

Mper GPU6×nP×d×pM_{\text{per GPU}} \approx 6 \times \frac{n}{P} \times d \times p

For 8 GPUs, n=1Mn = 1M, d=8192d = 8192 (model dim), FP16:

Mper GPU6×1068×8192×212.3 GBM_{\text{per GPU}} \approx 6 \times \frac{10^6}{8} \times 8192 \times 2 \approx 12.3 \text{ GB}

Compare to single-GPU: 6×106×8192×298 GB6 \times 10^6 \times 8192 \times 2 \approx 98 \text{ GB}.

Ring attention reduces per-GPU memory by P×P\times.

DeepSpeed Ulysses: An Alternative

DeepSpeed Ulysses takes a different approach to sequence parallelism:

Instead of passing K,V around a ring, it uses all-to-all communication to redistribute the sequence:

  1. Each GPU has a chunk of the sequence with full Q, K, V
  2. All-to-all redistributes so each GPU gets all positions but only some heads
  3. Each GPU computes attention for its assigned heads
  4. All-to-all redistributes back

Before all-to-all: GPUi has [tin/P:t(i+1)n/P] for all heads\text{Before all-to-all: GPU}_i \text{ has } [t_{i \cdot n/P} : t_{(i+1) \cdot n/P}] \text{ for all heads}

After all-to-all: GPUi has all tokens for heads [hiH/P:h(i+1)H/P]\text{After all-to-all: GPU}_i \text{ has all tokens for heads } [h_{i \cdot H/P} : h_{(i+1) \cdot H/P}]

Ring Attention vs. DeepSpeed Ulysses:

PropertyRing AttentionDeepSpeed Ulysses
Communication patternPoint-to-point (ring)All-to-all
Communication volumeO(nd)O(n \cdot d)O(nd)O(n \cdot d)
Overlap with computeYes (natural)Harder
Implementation complexityModerateLower
Best forVery long sequencesModerate sequences with many heads

Scaling Limits

The maximum context length is now bounded by total GPU memory across all nodes:

nmax=P×(MGPUMweights/PMoverhead)6×d×pn_{\max} = \frac{P \times (M_{\text{GPU}} - M_{\text{weights}}/P' - M_{\text{overhead}})}{6 \times d \times p}

Where PP' is the tensor parallelism degree (for model weight distribution).

With 64 H100 GPUs (80GB each), Llama 70B:

nmax64×(80140/83)×1096×8192×240M tokensn_{\max} \approx \frac{64 \times (80 - 140/8 - 3) \times 10^9}{6 \times 8192 \times 2} \approx 40M \text{ tokens}

Theoretically, 40 million tokens of context — though attention dilution would make most of that useless.

The Practical State of Affairs

Today’s production systems typically combine:

This stack enables the 1M+ context windows offered by frontier models — but the engineering complexity is enormous, which is why only a handful of companies can operate at this scale.


ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai