Benchmarking Context Windows: What Each Test Actually Measures
The School Exam Analogy
Imagine testing a student’s knowledge with three different exams:
- Exam 1 (NIAH): Hide one fact in a textbook. Ask the student to find it. Tests: “Can you find a specific thing?”
- Exam 2 (RULER): Ask the student to find, count, sort, and cross-reference multiple facts. Tests: “Can you work with information?”
- Exam 3 (MRCR): Hide multiple related facts and ask the student to connect them. Tests: “Can you reason across scattered information?”
A student who passes Exam 1 might fail Exams 2 and 3. Similarly, a model that aces NIAH might struggle with real-world long-context tasks.
NIAH: Needle in a Haystack
What It Is
The simplest long-context benchmark. Insert a specific fact (“needle”) into a large body of irrelevant text (“haystack”) and ask the model to retrieve it.
Example:
[10,000 tokens of random essays about geography]
The secret password for the treasure is: GOLDEN ELEPHANT
[90,000 tokens of random essays about history]
Question: What is the secret password for the treasure?What It Measures
- Basic retrieval capability at various context lengths
- Positional bias (does accuracy depend on where the needle is placed?)
Implementation
import random
import string
def create_niah_test(
needle: str,
haystack_tokens: int = 100_000,
needle_position: float = 0.5, # 0.0 = start, 1.0 = end
) -> dict:
"""
Create a Needle in a Haystack test case.
Args:
needle: The fact to hide
haystack_tokens: Total context length in tokens
needle_position: Where to place the needle (0-1)
"""
# Generate haystack text (random coherent-looking text)
# In practice, use real documents (Paul Graham essays, Wikipedia, etc.)
words_needed = int(haystack_tokens * 0.75)
filler_topics = [
"The history of maritime navigation spans thousands of years.",
"Ancient civilizations developed sophisticated irrigation systems.",
"The study of celestial bodies has fascinated humanity.",
"Philosophical debates about consciousness continue today.",
]
haystack_words = []
while len(haystack_words) < words_needed:
haystack_words.extend(random.choice(filler_topics).split())
haystack = " ".join(haystack_words[:words_needed])
# Insert needle at specified position
words = haystack.split()
insert_idx = int(len(words) * needle_position)
words.insert(insert_idx, needle)
return {
"context": " ".join(words),
"needle": needle,
"position": needle_position,
"total_tokens": haystack_tokens,
"question": f"Based on the context above, {needle.split(':')[0].lower().strip()}?",
}
def run_niah_suite(model_fn, context_lengths, positions):
"""
Run a full NIAH evaluation across lengths and positions.
Returns accuracy matrix: positions × context_lengths
"""
needle = "The secret code for Project Alpha is: CRIMSON-7749"
question = "What is the secret code for Project Alpha?"
results = {}
for length in context_lengths:
for pos in positions:
test = create_niah_test(needle, length, pos)
# Call model
response = model_fn(test["context"] + f"\n\nQuestion: {question}")
# Check if needle content is in response
correct = "CRIMSON-7749" in response
results[(length, pos)] = correct
# Display results
print(f"{'':>12}", end="")
for pos in positions:
print(f"{pos:>8.1%}", end="")
print()
for length in context_lengths:
print(f"{length:>10,}t", end="")
for pos in positions:
result = "✓" if results.get((length, pos)) else "✗"
print(f"{result:>8}", end="")
print()
return resultsWhat It Misses
NIAH is the easiest possible long-context test. It only requires:
- Finding one exact string
- No reasoning or synthesis
- No handling of contradictory information
- No multi-step retrieval
A model can score 100% on NIAH and still fail at real-world tasks.
RULER: Beyond Simple Retrieval
What It Is
RULER (Hsieh et al., 2024) extends NIAH with four categories of tasks:
- Single NIAH: Standard needle retrieval (baseline)
- Multi-Key NIAH: Find multiple needles scattered throughout
- Multi-Value NIAH: Find needles with multiple associated values
- Multi-Query NIAH: Answer multiple questions about different needles
Plus aggregation tasks: 5. Common Words: Find the most frequent word across a long text 6. Frequent Words: Count occurrences of specific words
And tracking tasks: 7. Variable Tracking: Track variable assignments through code-like text
Scoring
RULER uses a composite score across all tasks, weighted by difficulty:
Where are task weights (harder tasks weighted more).
Results
| Model | RULER at 128K | RULER at 512K | RULER at 1M |
|---|---|---|---|
| GPT-4.1 | 84.7 | 62.3 | 37.1 |
| Claude 3.5 | 87.2 | 71.4 | 52.3 |
| Gemini 1.5 Pro | 82.1 | 65.8 | 48.6 |
| Llama 3.1 70B | 79.3 | 52.1 | 28.4 |
Key observation: all models degrade significantly beyond 128K. The gap between NIAH (near-100%) and RULER (50-85%) reveals the difference between “can find a needle” and “can work with long context.”
MRCR: Multi-Needle Retrieval with Context Reasoning
What It Is
MRCR (2024) tests whether models can retrieve and reason across multiple pieces of scattered information:
[Context: 50K tokens of project documentation]
Scattered throughout:
- "The API rate limit is 1000 requests per minute" (position 12%)
- "Each request consumes 50 compute units" (position 45%)
- "The monthly compute budget is 5,000,000 units" (position 78%)
Question: "How many requests can we make per month
before exceeding the compute budget?"
Required reasoning: 5,000,000 / 50 = 100,000 requests/month
At 1000/min: 100 minutes of max throughputThis requires:
- Finding all three facts (multi-needle retrieval)
- Connecting them logically (reasoning)
- Computing the answer (arithmetic)
def create_mrcr_test(
facts: list[tuple[str, float]], # (fact_text, position)
question: str,
answer: str,
context_tokens: int = 100_000,
) -> dict:
"""Create an MRCR test case with multiple scattered facts."""
# Generate filler text
filler = generate_filler_text(context_tokens)
words = filler.split()
# Insert facts at specified positions
for fact, position in sorted(facts, key=lambda x: x[1], reverse=True):
insert_idx = int(len(words) * position)
# Wrap fact to blend in
wrapped = f"Important note: {fact}"
words.insert(insert_idx, wrapped)
return {
"context": " ".join(words),
"question": question,
"expected_answer": answer,
"n_facts": len(facts),
"fact_positions": [pos for _, pos in facts],
}
def evaluate_mrcr(model_fn, test_cases):
"""Evaluate model on MRCR test suite."""
results = {"retrieval": 0, "reasoning": 0, "total": 0}
for test in test_cases:
response = model_fn(
test["context"] + f"\n\nQuestion: {test['question']}"
)
# Check if answer is correct
correct = test["expected_answer"].lower() in response.lower()
results["total"] += 1
if correct:
results["reasoning"] += 1
accuracy = results["reasoning"] / results["total"] * 100
print(f"MRCR Accuracy: {accuracy:.1f}% ({results['reasoning']}/{results['total']})")
return accuracyWhat Benchmarks Miss: Context Rot
All three benchmarks test retrieval — can the model find information? But real-world long-context usage involves sustained quality — does the model’s reasoning, instruction following, and coherence degrade over time?
Context rot manifests as:
- Forgetting earlier instructions
- Contradicting previous statements
- Repeating itself
- Decreasing code quality
- Ignoring formatting requirements
No current benchmark adequately measures context rot because it’s a gradual, multi-dimensional degradation rather than a binary pass/fail.
Building Your Own Evaluation
def custom_context_evaluation(model_fn, your_documents, your_questions):
"""
Build a task-specific context evaluation.
Key: test with YOUR data and YOUR tasks, not synthetic benchmarks.
"""
results = []
for ctx_len in [4_000, 32_000, 128_000]:
for question, expected in your_questions:
# Pad or truncate documents to target length
context = prepare_context(your_documents, ctx_len)
response = model_fn(context + f"\n\n{question}")
score = evaluate_response(response, expected)
results.append({
"context_length": ctx_len,
"question": question,
"score": score,
})
# Analyze degradation
for ctx_len in [4_000, 32_000, 128_000]:
scores = [r["score"] for r in results if r["context_length"] == ctx_len]
avg = sum(scores) / len(scores)
print(f"Context {ctx_len:>8,}: avg score = {avg:.2f}")
return resultsThe Bottom Line
| Benchmark | Tests | Difficulty | Real-World Relevance |
|---|---|---|---|
| NIAH | Single fact retrieval | Easy | Low |
| RULER | Multi-fact retrieval + aggregation | Medium | Medium |
| MRCR | Multi-fact reasoning | Hard | Medium-High |
| Custom eval | Your actual use case | Varies | Highest |
Don’t trust a model’s context window claims based on NIAH alone. A model that finds a needle at 1M tokens may still fail at reasoning over 50K tokens of real code.
The best benchmark is always your own data, your own tasks, at your target context length.
ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Private Code Context retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai