You're not imagining it. Research tested 18 frontier models and every single one degrades as the conversation gets longer. Here's the science behind it.
Have you ever had a long conversation with an AI where it started out brilliant — giving precise, thoughtful answers — and by message 30, it was repeating itself, contradicting earlier statements, or ignoring your instructions?
You’re not imagining it. And it’s not because the AI is “getting tired.”
Research has demonstrated this empirically: when scientists tested 18 frontier models (including GPT-4, Claude 3, Gemini, and Llama), every single one showed measurable quality degradation as context length increased. No exceptions.
Let’s understand why.
Think of the AI’s attention like a teacher in a classroom. With 5 students, the teacher can give each student significant individual attention — about 20% each:
With 30 students:
With 200 students:
Each token in the context window is like a student demanding the AI’s attention. The more tokens there are, the less attention each one gets. This isn’t about the AI being lazy — it’s a mathematical constraint of how transformer models work.
Inside every transformer model, attention weights are computed using the softmax function:
Where:
The crucial property of softmax is that all weights must sum to 1:
This means attention is a zero-sum game. If a new token gets attention, some other token loses attention. As grows, even the most relevant token’s attention weight decreases because the denominator grows.
Let’s say you have 3 tokens with relevance scores and for simplicity:
Token 1 gets 96.5% of the attention. Great — the model finds the relevant token easily.
Now add 997 irrelevant tokens (each with score 1.0), making :
Token 1’s attention dropped from 96.5% to 5.2% — simply by adding irrelevant tokens to the context.
With :
Down to 0.0055%. The model can barely find the relevant token anymore.
Here’s a Python simulation that demonstrates attention dilution:
import numpy as np
def attention_weight(relevance_score, noise_score, n_total, d=64):
"""
Compute the attention weight of a relevant token
as the total context size grows.
"""
sqrt_d = np.sqrt(d)
# Score for the relevant token
relevant = np.exp(relevance_score / sqrt_d)
# Scores for all the noise tokens
noise = (n_total - 1) * np.exp(noise_score / sqrt_d)
# Softmax normalization
attention = relevant / (relevant + noise)
return attention
# Simulate attention dilution as context grows
context_sizes = [10, 100, 1_000, 10_000, 100_000, 1_000_000]
print(f"{'Context Size':>15} {'Attention Weight':>20} {'% of Max':>10}")
print("-" * 50)
for n in context_sizes:
weight = attention_weight(
relevance_score=5.0, # The important token
noise_score=1.0, # Background noise
n_total=n
)
print(f"{n:>15,} {weight:>20.8f} {weight*100:>9.4f}%")Output:
Context Size Attention Weight % of Max
--------------------------------------------------
10 0.91393120 91.3931%
100 0.53213820 53.2138%
1,000 0.09670833 9.6708%
10,000 0.01022744 1.0227%
100,000 0.00102669 0.1027%
1,000,000 0.00010271 0.0103%With a million tokens, the model can only devote 0.01% of its attention to the most relevant token. That’s why it “forgets” your instructions.
The attention dilution problem gets worse due to a phenomenon called “Lost in the Middle” (Liu et al., 2023). Researchers found that LLMs don’t distribute their diminished attention uniformly — they show a strong U-shaped bias:
import numpy as np
def simulate_positional_attention(n_tokens: int) -> np.ndarray:
"""
Simulate the U-shaped attention curve.
Tokens at the start and end get more attention than the middle.
"""
positions = np.arange(n_tokens)
normalized = positions / (n_tokens - 1) # 0 to 1
# U-shaped curve: high at edges, low in middle
# Using a simple quadratic model: f(x) = 4(x - 0.5)^2
positional_bias = 4 * (normalized - 0.5) ** 2
# Add base attention and normalize
attention = positional_bias + 0.2
attention = attention / attention.sum()
return attention
# Show attention at key positions for 1000-token context
n = 1000
attention = simulate_positional_attention(n)
positions = [0, 100, 250, 500, 750, 900, 999]
print(f"{'Position':>10} {'Relative Attention':>20}")
print("-" * 35)
for pos in positions:
print(f"{pos:>10} {attention[pos]/attention.max():>20.4f}")This means:
As conversations grow longer, multiple degradation effects compound:
The combined effect is what engineers call “context rot” — a gradual decay in response quality that worsens monotonically with context length.
Research has shown retrieval accuracy follows roughly this pattern:
Where:
For a model with 80% accuracy at 128K tokens and :
Accuracy drops from 80% to 53% — a massive degradation.
The simplest fix: don’t let conversations get too long. When you notice quality degrading, start a new conversation and re-state your key context.
Due to the U-shaped attention curve, place your most critical instructions at the very beginning (system prompt) and repeat key details near the end of your context.
If your conversation includes irrelevant tangents, long error logs, or outdated code snippets that are no longer relevant — remove them. Every unnecessary token dilutes attention from the tokens that matter.
Instead of carrying everything in context, use a retrieval system that pulls in only the relevant information for each query. This keeps context small and focused.
# Bad: Stuff everything in context
context = entire_codebase + all_docs + full_conversation_history
# Good: Retrieve only what's relevant
relevant_code = retriever.search(user_query, top_k=5)
relevant_docs = doc_search.search(user_query, top_k=3)
recent_history = conversation[-5:] # Last 5 turns only
context = relevant_code + relevant_docs + recent_historyThere’s an inherent tension in AI systems:
But also:
The optimal point isn’t “maximum context” — it’s “just enough context.” That’s the point where you have all the information the model needs, but nothing more.
Finding that optimal point is the fundamental challenge of context management.
ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai