Needle in a Haystack tests one thing. RULER tests another. MRCR tests yet another. Here's what each benchmark actually measures and what it misses.
Imagine testing a student’s knowledge with three different exams:
A student who passes Exam 1 might fail Exams 2 and 3. Similarly, a model that aces NIAH might struggle with real-world long-context tasks.
The simplest long-context benchmark. Insert a specific fact (“needle”) into a large body of irrelevant text (“haystack”) and ask the model to retrieve it.
Example:
[10,000 tokens of random essays about geography]
The secret password for the treasure is: GOLDEN ELEPHANT
[90,000 tokens of random essays about history]
Question: What is the secret password for the treasure?import random
import string
def create_niah_test(
needle: str,
haystack_tokens: int = 100_000,
needle_position: float = 0.5, # 0.0 = start, 1.0 = end
) -> dict:
"""
Create a Needle in a Haystack test case.
Args:
needle: The fact to hide
haystack_tokens: Total context length in tokens
needle_position: Where to place the needle (0-1)
"""
# Generate haystack text (random coherent-looking text)
# In practice, use real documents (Paul Graham essays, Wikipedia, etc.)
words_needed = int(haystack_tokens * 0.75)
filler_topics = [
"The history of maritime navigation spans thousands of years.",
"Ancient civilizations developed sophisticated irrigation systems.",
"The study of celestial bodies has fascinated humanity.",
"Philosophical debates about consciousness continue today.",
]
haystack_words = []
while len(haystack_words) < words_needed:
haystack_words.extend(random.choice(filler_topics).split())
haystack = " ".join(haystack_words[:words_needed])
# Insert needle at specified position
words = haystack.split()
insert_idx = int(len(words) * needle_position)
words.insert(insert_idx, needle)
return {
"context": " ".join(words),
"needle": needle,
"position": needle_position,
"total_tokens": haystack_tokens,
"question": f"Based on the context above, {needle.split(':')[0].lower().strip()}?",
}
def run_niah_suite(model_fn, context_lengths, positions):
"""
Run a full NIAH evaluation across lengths and positions.
Returns accuracy matrix: positions × context_lengths
"""
needle = "The secret code for Project Alpha is: CRIMSON-7749"
question = "What is the secret code for Project Alpha?"
results = {}
for length in context_lengths:
for pos in positions:
test = create_niah_test(needle, length, pos)
# Call model
response = model_fn(test["context"] + f"\n\nQuestion: {question}")
# Check if needle content is in response
correct = "CRIMSON-7749" in response
results[(length, pos)] = correct
# Display results
print(f"{'':>12}", end="")
for pos in positions:
print(f"{pos:>8.1%}", end="")
print()
for length in context_lengths:
print(f"{length:>10,}t", end="")
for pos in positions:
result = "✓" if results.get((length, pos)) else "✗"
print(f"{result:>8}", end="")
print()
return resultsNIAH is the easiest possible long-context test. It only requires:
A model can score 100% on NIAH and still fail at real-world tasks.
RULER (Hsieh et al., 2024) extends NIAH with four categories of tasks:
Plus aggregation tasks: 5. Common Words: Find the most frequent word across a long text 6. Frequent Words: Count occurrences of specific words
And tracking tasks: 7. Variable Tracking: Track variable assignments through code-like text
RULER uses a composite score across all tasks, weighted by difficulty:
Where are task weights (harder tasks weighted more).
| Model | RULER at 128K | RULER at 512K | RULER at 1M |
|---|---|---|---|
| GPT-4.1 | 84.7 | 62.3 | 37.1 |
| Claude 3.5 | 87.2 | 71.4 | 52.3 |
| Gemini 1.5 Pro | 82.1 | 65.8 | 48.6 |
| Llama 3.1 70B | 79.3 | 52.1 | 28.4 |
Key observation: all models degrade significantly beyond 128K. The gap between NIAH (near-100%) and RULER (50-85%) reveals the difference between “can find a needle” and “can work with long context.”
MRCR (2024) tests whether models can retrieve and reason across multiple pieces of scattered information:
[Context: 50K tokens of project documentation]
Scattered throughout:
- "The API rate limit is 1000 requests per minute" (position 12%)
- "Each request consumes 50 compute units" (position 45%)
- "The monthly compute budget is 5,000,000 units" (position 78%)
Question: "How many requests can we make per month
before exceeding the compute budget?"
Required reasoning: 5,000,000 / 50 = 100,000 requests/month
At 1000/min: 100 minutes of max throughputThis requires:
def create_mrcr_test(
facts: list[tuple[str, float]], # (fact_text, position)
question: str,
answer: str,
context_tokens: int = 100_000,
) -> dict:
"""Create an MRCR test case with multiple scattered facts."""
# Generate filler text
filler = generate_filler_text(context_tokens)
words = filler.split()
# Insert facts at specified positions
for fact, position in sorted(facts, key=lambda x: x[1], reverse=True):
insert_idx = int(len(words) * position)
# Wrap fact to blend in
wrapped = f"Important note: {fact}"
words.insert(insert_idx, wrapped)
return {
"context": " ".join(words),
"question": question,
"expected_answer": answer,
"n_facts": len(facts),
"fact_positions": [pos for _, pos in facts],
}
def evaluate_mrcr(model_fn, test_cases):
"""Evaluate model on MRCR test suite."""
results = {"retrieval": 0, "reasoning": 0, "total": 0}
for test in test_cases:
response = model_fn(
test["context"] + f"\n\nQuestion: {test['question']}"
)
# Check if answer is correct
correct = test["expected_answer"].lower() in response.lower()
results["total"] += 1
if correct:
results["reasoning"] += 1
accuracy = results["reasoning"] / results["total"] * 100
print(f"MRCR Accuracy: {accuracy:.1f}% ({results['reasoning']}/{results['total']})")
return accuracyAll three benchmarks test retrieval — can the model find information? But real-world long-context usage involves sustained quality — does the model’s reasoning, instruction following, and coherence degrade over time?
Context rot manifests as:
No current benchmark adequately measures context rot because it’s a gradual, multi-dimensional degradation rather than a binary pass/fail.
def custom_context_evaluation(model_fn, your_documents, your_questions):
"""
Build a task-specific context evaluation.
Key: test with YOUR data and YOUR tasks, not synthetic benchmarks.
"""
results = []
for ctx_len in [4_000, 32_000, 128_000]:
for question, expected in your_questions:
# Pad or truncate documents to target length
context = prepare_context(your_documents, ctx_len)
response = model_fn(context + f"\n\n{question}")
score = evaluate_response(response, expected)
results.append({
"context_length": ctx_len,
"question": question,
"score": score,
})
# Analyze degradation
for ctx_len in [4_000, 32_000, 128_000]:
scores = [r["score"] for r in results if r["context_length"] == ctx_len]
avg = sum(scores) / len(scores)
print(f"Context {ctx_len:>8,}: avg score = {avg:.2f}")
return results| Benchmark | Tests | Difficulty | Real-World Relevance |
|---|---|---|---|
| NIAH | Single fact retrieval | Easy | Low |
| RULER | Multi-fact retrieval + aggregation | Medium | Medium |
| MRCR | Multi-fact reasoning | Hard | Medium-High |
| Custom eval | Your actual use case | Varies | Highest |
Don’t trust a model’s context window claims based on NIAH alone. A model that finds a needle at 1M tokens may still fail at reasoning over 50K tokens of real code.
The best benchmark is always your own data, your own tasks, at your target context length.
ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai