Mar 31, 2026 lost in the middle attention distribution prompt engineering RAG softmax attention

Research proved that LLMs attend strongly to the beginning and end of context but poorly to the middle. This is a mathematical property, not a bug.

The “Lost in the Middle” Problem

The Book Report Analogy

Remember writing book reports in school? You’d read the first chapter carefully (it’s the beginning!), skim through the middle, and then read the last chapter carefully (you need to know the ending!). The middle chapters? A blur.

AI models do the same thing — and researchers proved it with hard data.

In 2023, Liu et al. published a landmark paper showing that every major LLM exhibits the same pattern: strong recall at the beginning and end of the context, poor recall in the middle. They called it the “Lost in the Middle” problem, and it’s not a bug that will be patched. It’s a mathematical property of how attention works.

The Experiment

The researchers designed a simple test:

Give the model 20 documents
Hide the answer to a question in one specific document
Vary which document position (1st through 20th) contains the answer
Measure accuracy at each position

The results formed a clear U-shaped curve:

Answer Position	Accuracy
Position 1 (start)	~85%
Position 5	~65%
Position 10 (middle)	~45%
Position 15	~60%
Position 20 (end)	~80%

The model found the answer 85% of the time when it was first, but only 45% of the time when it was in the middle. That’s almost a coin flip.

Why This Happens: The Softmax Math

The root cause is the softmax function that normalizes attention weights:

$\alpha_i = \frac{\exp(z_i)}{\sum_{j=1}^{n} \exp(z_j)}$

Where $z_i = q \cdot k_i / \sqrt{d}$ is the raw attention score for token at position $i$ .

Positional Bias in Attention Scores

In practice, attention scores aren’t purely content-based. Position encodings (like RoPE) add a positional signal:

$z_i = \underbrace{q_{\text{content}} \cdot k_{\text{content}}}_{\text{content relevance}} + \underbrace{f(|i - j|)}_{\text{positional bias}}$

The positional bias $f$ typically decays with distance, but in most architectures, it has a recency bias — recent tokens get a boost. Combined with the initial tokens having been through all attention layers (and thus being highly refined), this creates the U-shaped pattern.

The Entropy Connection

We can measure how “spread out” or “focused” the attention distribution is using entropy:

$H(\alpha) = -\sum_{i=1}^{n} \alpha_i \log \alpha_i$

Maximum entropy (uniform attention): $H_{\max} = \log n$

As context grows, entropy tends toward maximum, meaning attention becomes more uniform — and uniformly diluted.

import numpy as np

def softmax(x):
    """Numerically stable softmax."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

def entropy(probs):
    """Shannon entropy of a probability distribution."""
    probs = probs[probs > 0]  # Avoid log(0)
    return -np.sum(probs * np.log2(probs))

def simulate_lost_in_middle(n_documents=20, n_trials=1000):
    """
    Simulate the 'lost in the middle' effect.

    Model attention with positional bias: tokens at the start
    and end receive higher base scores.
    """
    positions = np.arange(n_documents)
    results = np.zeros(n_documents)

    for trial in range(n_trials):
        for answer_pos in range(n_documents):
            # Generate random relevance scores
            scores = np.random.randn(n_documents) * 0.5

            # The answer document has higher relevance
            scores[answer_pos] += 3.0

            # Add U-shaped positional bias
            # Higher at start (primacy) and end (recency)
            normalized_pos = positions / (n_documents - 1)  # 0 to 1
            positional_bias = 1.5 * (2 * (normalized_pos - 0.5) ** 2)
            scores += positional_bias

            # Compute attention weights
            attn = softmax(scores)

            # "Found" if the answer position has highest attention
            if np.argmax(attn) == answer_pos:
                results[answer_pos] += 1

    accuracy = results / n_trials * 100
    return accuracy

accuracy = simulate_lost_in_middle()

print("Position  Accuracy")
print("-" * 25)
for i, acc in enumerate(accuracy):
    bar = "█" * int(acc / 2)
    print(f"  {i+1:>2}       {acc:5.1f}%  {bar}")

The Impact at Scale

The lost-in-the-middle effect gets worse with longer contexts:

At 4K context (20 documents × 200 tokens each):

Middle position accuracy: ~55%
The U-shape is mild

At 128K context (hundreds of documents):

Middle position accuracy: ~30%
The U-shape is severe

At 1M context (thousands of documents):

Middle position accuracy: ~15–20%
The model essentially can’t find information in the middle

The mathematical reason is that as $n$ grows, the softmax denominator grows, and the positional bias signal becomes relatively stronger compared to the content relevance signal:

$\lim_{n \to \infty} \frac{\text{content signal}}{\text{total signal}} \to 0$

Practical Implications for Prompt Design

Rule 1: Put Critical Information First

System prompts, instructions, and key constraints should go at the very beginning of your context:

✅ GOOD:
[Critical instructions]
[Context document 1]
[Context document 2]
[User question]

❌ BAD:
[Context document 1]
[Context document 2]
[Critical instructions]  ← Lost in the middle!
[More context]
[User question]

Rule 2: Repeat Key Instructions at the End

For long contexts, repeat your most important instructions near the end:

[System prompt with instructions]
[Long context: documents, code, conversation history]
[Reminder: "Remember to follow the formatting rules from the system prompt"]
[User question]

Rule 3: Use RAG Instead of Context Stuffing

Instead of putting 50 documents into context and hoping the model finds the right one, use retrieval to select the 3–5 most relevant documents:

def smart_context_construction(
    question: str,
    documents: list[str],
    retriever,
    max_docs: int = 5
) -> str:
    """
    Use retrieval to select relevant documents
    instead of stuffing everything in context.
    """
    # Retrieve only relevant documents
    relevant_docs = retriever.search(question, top_k=max_docs)

    # Place most relevant at start and end (U-shaped optimization)
    if len(relevant_docs) >= 3:
        # Most relevant first, second-most relevant last
        reordered = [relevant_docs[0]]
        reordered.extend(relevant_docs[2:])
        reordered.append(relevant_docs[1])
    else:
        reordered = relevant_docs

    context = "\n\n".join([
        f"Document {i+1}:\n{doc}"
        for i, doc in enumerate(reordered)
    ])

    return context

Rule 4: Structure Long Documents with Headers

Clear headers and section markers help the model’s attention “anchor” to structure:

## Section 1: API Authentication
[content about auth]

## Section 2: Rate Limits
[content about rate limits]

## Section 3: Error Handling  ← Even in the middle,
[content about errors]         headers help the model
                               locate this section

Comparison Across Models

The lost-in-the-middle effect is universal but varies in severity:

Model	Start Accuracy	Middle Accuracy	End Accuracy
GPT-4	87%	52%	83%
Claude 3	89%	58%	85%
Llama 3 70B	82%	43%	78%
Gemini 1.5	85%	55%	82%

No model is immune. The effect is a structural property of softmax attention, not a training limitation. Better training can reduce the severity but cannot eliminate it.

The Fundamental Insight

The lost-in-the-middle problem reveals a deep truth about AI context management: having information available is not the same as being able to use it effectively.

A model with 1M tokens of context capacity and your answer buried in the middle might perform worse than a model with 4K tokens of context that contains only the relevant information.

This is why the future of AI isn’t just bigger context windows — it’s smarter context selection.

ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Private Code Context retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai

← All posts