Benchmarking Context Windows: NIAH, RULER, MRCR, and What They Actually Measure

Needle in a Haystack tests one thing. RULER tests another. MRCR tests yet another. Here's what each benchmark actually measures and what it misses.

Benchmarking Context Windows: NIAH, RULER, MRCR, and What They Actually Measure

Benchmarking Context Windows: What Each Test Actually Measures

The School Exam Analogy

Imagine testing a student’s knowledge with three different exams:

A student who passes Exam 1 might fail Exams 2 and 3. Similarly, a model that aces NIAH might struggle with real-world long-context tasks.

NIAH: Needle in a Haystack

What It Is

The simplest long-context benchmark. Insert a specific fact (“needle”) into a large body of irrelevant text (“haystack”) and ask the model to retrieve it.

Example:

[10,000 tokens of random essays about geography]
The secret password for the treasure is: GOLDEN ELEPHANT
[90,000 tokens of random essays about history]

Question: What is the secret password for the treasure?

What It Measures

Implementation

import random
import string

def create_niah_test(
    needle: str,
    haystack_tokens: int = 100_000,
    needle_position: float = 0.5,  # 0.0 = start, 1.0 = end
) -> dict:
    """
    Create a Needle in a Haystack test case.

    Args:
        needle: The fact to hide
        haystack_tokens: Total context length in tokens
        needle_position: Where to place the needle (0-1)
    """
    # Generate haystack text (random coherent-looking text)
    # In practice, use real documents (Paul Graham essays, Wikipedia, etc.)
    words_needed = int(haystack_tokens * 0.75)

    filler_topics = [
        "The history of maritime navigation spans thousands of years.",
        "Ancient civilizations developed sophisticated irrigation systems.",
        "The study of celestial bodies has fascinated humanity.",
        "Philosophical debates about consciousness continue today.",
    ]

    haystack_words = []
    while len(haystack_words) < words_needed:
        haystack_words.extend(random.choice(filler_topics).split())

    haystack = " ".join(haystack_words[:words_needed])

    # Insert needle at specified position
    words = haystack.split()
    insert_idx = int(len(words) * needle_position)
    words.insert(insert_idx, needle)

    return {
        "context": " ".join(words),
        "needle": needle,
        "position": needle_position,
        "total_tokens": haystack_tokens,
        "question": f"Based on the context above, {needle.split(':')[0].lower().strip()}?",
    }


def run_niah_suite(model_fn, context_lengths, positions):
    """
    Run a full NIAH evaluation across lengths and positions.

    Returns accuracy matrix: positions × context_lengths
    """
    needle = "The secret code for Project Alpha is: CRIMSON-7749"
    question = "What is the secret code for Project Alpha?"

    results = {}

    for length in context_lengths:
        for pos in positions:
            test = create_niah_test(needle, length, pos)

            # Call model
            response = model_fn(test["context"] + f"\n\nQuestion: {question}")

            # Check if needle content is in response
            correct = "CRIMSON-7749" in response
            results[(length, pos)] = correct

    # Display results
    print(f"{'':>12}", end="")
    for pos in positions:
        print(f"{pos:>8.1%}", end="")
    print()

    for length in context_lengths:
        print(f"{length:>10,}t", end="")
        for pos in positions:
            result = "✓" if results.get((length, pos)) else "✗"
            print(f"{result:>8}", end="")
        print()

    return results

What It Misses

NIAH is the easiest possible long-context test. It only requires:

A model can score 100% on NIAH and still fail at real-world tasks.

RULER: Beyond Simple Retrieval

What It Is

RULER (Hsieh et al., 2024) extends NIAH with four categories of tasks:

  1. Single NIAH: Standard needle retrieval (baseline)
  2. Multi-Key NIAH: Find multiple needles scattered throughout
  3. Multi-Value NIAH: Find needles with multiple associated values
  4. Multi-Query NIAH: Answer multiple questions about different needles

Plus aggregation tasks: 5. Common Words: Find the most frequent word across a long text 6. Frequent Words: Count occurrences of specific words

And tracking tasks: 7. Variable Tracking: Track variable assignments through code-like text

Scoring

RULER uses a composite score across all tasks, weighted by difficulty:

RULER Score=i=17wi×Accuracyi\text{RULER Score} = \sum_{i=1}^{7} w_i \times \text{Accuracy}_i

Where wiw_i are task weights (harder tasks weighted more).

Results

ModelRULER at 128KRULER at 512KRULER at 1M
GPT-4.184.762.337.1
Claude 3.587.271.452.3
Gemini 1.5 Pro82.165.848.6
Llama 3.1 70B79.352.128.4

Key observation: all models degrade significantly beyond 128K. The gap between NIAH (near-100%) and RULER (50-85%) reveals the difference between “can find a needle” and “can work with long context.”

MRCR: Multi-Needle Retrieval with Context Reasoning

What It Is

MRCR (2024) tests whether models can retrieve and reason across multiple pieces of scattered information:

[Context: 50K tokens of project documentation]

Scattered throughout:
- "The API rate limit is 1000 requests per minute" (position 12%)
- "Each request consumes 50 compute units" (position 45%)
- "The monthly compute budget is 5,000,000 units" (position 78%)

Question: "How many requests can we make per month
           before exceeding the compute budget?"

Required reasoning: 5,000,000 / 50 = 100,000 requests/month
                    At 1000/min: 100 minutes of max throughput

This requires:

  1. Finding all three facts (multi-needle retrieval)
  2. Connecting them logically (reasoning)
  3. Computing the answer (arithmetic)
def create_mrcr_test(
    facts: list[tuple[str, float]],  # (fact_text, position)
    question: str,
    answer: str,
    context_tokens: int = 100_000,
) -> dict:
    """Create an MRCR test case with multiple scattered facts."""

    # Generate filler text
    filler = generate_filler_text(context_tokens)
    words = filler.split()

    # Insert facts at specified positions
    for fact, position in sorted(facts, key=lambda x: x[1], reverse=True):
        insert_idx = int(len(words) * position)
        # Wrap fact to blend in
        wrapped = f"Important note: {fact}"
        words.insert(insert_idx, wrapped)

    return {
        "context": " ".join(words),
        "question": question,
        "expected_answer": answer,
        "n_facts": len(facts),
        "fact_positions": [pos for _, pos in facts],
    }

def evaluate_mrcr(model_fn, test_cases):
    """Evaluate model on MRCR test suite."""
    results = {"retrieval": 0, "reasoning": 0, "total": 0}

    for test in test_cases:
        response = model_fn(
            test["context"] + f"\n\nQuestion: {test['question']}"
        )

        # Check if answer is correct
        correct = test["expected_answer"].lower() in response.lower()
        results["total"] += 1
        if correct:
            results["reasoning"] += 1

    accuracy = results["reasoning"] / results["total"] * 100
    print(f"MRCR Accuracy: {accuracy:.1f}% ({results['reasoning']}/{results['total']})")
    return accuracy

What Benchmarks Miss: Context Rot

All three benchmarks test retrieval — can the model find information? But real-world long-context usage involves sustained quality — does the model’s reasoning, instruction following, and coherence degrade over time?

Context rot manifests as:

No current benchmark adequately measures context rot because it’s a gradual, multi-dimensional degradation rather than a binary pass/fail.

Building Your Own Evaluation

def custom_context_evaluation(model_fn, your_documents, your_questions):
    """
    Build a task-specific context evaluation.

    Key: test with YOUR data and YOUR tasks, not synthetic benchmarks.
    """
    results = []

    for ctx_len in [4_000, 32_000, 128_000]:
        for question, expected in your_questions:
            # Pad or truncate documents to target length
            context = prepare_context(your_documents, ctx_len)

            response = model_fn(context + f"\n\n{question}")

            score = evaluate_response(response, expected)
            results.append({
                "context_length": ctx_len,
                "question": question,
                "score": score,
            })

    # Analyze degradation
    for ctx_len in [4_000, 32_000, 128_000]:
        scores = [r["score"] for r in results if r["context_length"] == ctx_len]
        avg = sum(scores) / len(scores)
        print(f"Context {ctx_len:>8,}: avg score = {avg:.2f}")

    return results

The Bottom Line

BenchmarkTestsDifficultyReal-World Relevance
NIAHSingle fact retrievalEasyLow
RULERMulti-fact retrieval + aggregationMediumMedium
MRCRMulti-fact reasoningHardMedium-High
Custom evalYour actual use caseVariesHighest

Don’t trust a model’s context window claims based on NIAH alone. A model that finds a needle at 1M tokens may still fail at reasoning over 50K tokens of real code.

The best benchmark is always your own data, your own tasks, at your target context length.


ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai