Research proved that LLMs attend strongly to the beginning and end of context but poorly to the middle. This is a mathematical property, not a bug.
Remember writing book reports in school? You’d read the first chapter carefully (it’s the beginning!), skim through the middle, and then read the last chapter carefully (you need to know the ending!). The middle chapters? A blur.
AI models do the same thing — and researchers proved it with hard data.
In 2023, Liu et al. published a landmark paper showing that every major LLM exhibits the same pattern: strong recall at the beginning and end of the context, poor recall in the middle. They called it the “Lost in the Middle” problem, and it’s not a bug that will be patched. It’s a mathematical property of how attention works.
The researchers designed a simple test:
The results formed a clear U-shaped curve:
| Answer Position | Accuracy |
|---|---|
| Position 1 (start) | ~85% |
| Position 5 | ~65% |
| Position 10 (middle) | ~45% |
| Position 15 | ~60% |
| Position 20 (end) | ~80% |
The model found the answer 85% of the time when it was first, but only 45% of the time when it was in the middle. That’s almost a coin flip.
The root cause is the softmax function that normalizes attention weights:
Where is the raw attention score for token at position .
In practice, attention scores aren’t purely content-based. Position encodings (like RoPE) add a positional signal:
The positional bias typically decays with distance, but in most architectures, it has a recency bias — recent tokens get a boost. Combined with the initial tokens having been through all attention layers (and thus being highly refined), this creates the U-shaped pattern.
We can measure how “spread out” or “focused” the attention distribution is using entropy:
Maximum entropy (uniform attention):
As context grows, entropy tends toward maximum, meaning attention becomes more uniform — and uniformly diluted.
import numpy as np
def softmax(x):
"""Numerically stable softmax."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
def entropy(probs):
"""Shannon entropy of a probability distribution."""
probs = probs[probs > 0] # Avoid log(0)
return -np.sum(probs * np.log2(probs))
def simulate_lost_in_middle(n_documents=20, n_trials=1000):
"""
Simulate the 'lost in the middle' effect.
Model attention with positional bias: tokens at the start
and end receive higher base scores.
"""
positions = np.arange(n_documents)
results = np.zeros(n_documents)
for trial in range(n_trials):
for answer_pos in range(n_documents):
# Generate random relevance scores
scores = np.random.randn(n_documents) * 0.5
# The answer document has higher relevance
scores[answer_pos] += 3.0
# Add U-shaped positional bias
# Higher at start (primacy) and end (recency)
normalized_pos = positions / (n_documents - 1) # 0 to 1
positional_bias = 1.5 * (2 * (normalized_pos - 0.5) ** 2)
scores += positional_bias
# Compute attention weights
attn = softmax(scores)
# "Found" if the answer position has highest attention
if np.argmax(attn) == answer_pos:
results[answer_pos] += 1
accuracy = results / n_trials * 100
return accuracy
accuracy = simulate_lost_in_middle()
print("Position Accuracy")
print("-" * 25)
for i, acc in enumerate(accuracy):
bar = "█" * int(acc / 2)
print(f" {i+1:>2} {acc:5.1f}% {bar}")The lost-in-the-middle effect gets worse with longer contexts:
At 4K context (20 documents × 200 tokens each):
At 128K context (hundreds of documents):
At 1M context (thousands of documents):
The mathematical reason is that as grows, the softmax denominator grows, and the positional bias signal becomes relatively stronger compared to the content relevance signal:
System prompts, instructions, and key constraints should go at the very beginning of your context:
✅ GOOD:
[Critical instructions]
[Context document 1]
[Context document 2]
[User question]
❌ BAD:
[Context document 1]
[Context document 2]
[Critical instructions] ← Lost in the middle!
[More context]
[User question]For long contexts, repeat your most important instructions near the end:
[System prompt with instructions]
[Long context: documents, code, conversation history]
[Reminder: "Remember to follow the formatting rules from the system prompt"]
[User question]Instead of putting 50 documents into context and hoping the model finds the right one, use retrieval to select the 3–5 most relevant documents:
def smart_context_construction(
question: str,
documents: list[str],
retriever,
max_docs: int = 5
) -> str:
"""
Use retrieval to select relevant documents
instead of stuffing everything in context.
"""
# Retrieve only relevant documents
relevant_docs = retriever.search(question, top_k=max_docs)
# Place most relevant at start and end (U-shaped optimization)
if len(relevant_docs) >= 3:
# Most relevant first, second-most relevant last
reordered = [relevant_docs[0]]
reordered.extend(relevant_docs[2:])
reordered.append(relevant_docs[1])
else:
reordered = relevant_docs
context = "\n\n".join([
f"Document {i+1}:\n{doc}"
for i, doc in enumerate(reordered)
])
return contextClear headers and section markers help the model’s attention “anchor” to structure:
## Section 1: API Authentication
[content about auth]
## Section 2: Rate Limits
[content about rate limits]
## Section 3: Error Handling ← Even in the middle,
[content about errors] headers help the model
locate this sectionThe lost-in-the-middle effect is universal but varies in severity:
| Model | Start Accuracy | Middle Accuracy | End Accuracy |
|---|---|---|---|
| GPT-4 | 87% | 52% | 83% |
| Claude 3 | 89% | 58% | 85% |
| Llama 3 70B | 82% | 43% | 78% |
| Gemini 1.5 | 85% | 55% | 82% |
No model is immune. The effect is a structural property of softmax attention, not a training limitation. Better training can reduce the severity but cannot eliminate it.
The lost-in-the-middle problem reveals a deep truth about AI context management: having information available is not the same as being able to use it effectively.
A model with 1M tokens of context capacity and your answer buried in the middle might perform worse than a model with 4K tokens of context that contains only the relevant information.
This is why the future of AI isn’t just bigger context windows — it’s smarter context selection.
ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai