Existing benchmarks like HumanEval, MBPP, and SWE-Bench assume single-file, isolated context and cannot evaluate GraphRAG systems that reason across tens of thousands of files, multiple repositories, and evolving services. This post explains the unique failure modes in cross-repository retrieval and what metrics actually matter.

Most coding LLM benchmarks were never designed for the kind of system you’re actually building.
If your model just needs to fill in a function body given a clear signature and docstring, benchmarks like HumanEval and MBPP are reasonable proxies. But if you’re building a GraphRAG-powered assistant that reasons across tens of thousands of files, multiple repositories, and evolving services, those benchmarks will tell you almost nothing about whether your system actually works.
This post explains why—and what you should measure instead.
Benchmarks like HumanEval, MBPP, and SWE-Bench share a common assumption: the problem is self-contained in a small, static context.
HumanEval / MBPP
You get:
And the model:
The evaluation:
None of this requires retrieval. The “knowledge” is either:
Even SWE-Bench, which uses real GitHub issues and multi-file patches, is still fundamentally a single-repository setting. The model is evaluated on whether it can modify files within a given codebase to pass tests, not on whether it can retrieve and integrate context from other repositories or services.
These benchmarks are essential for measuring core code generation and reasoning, but they are almost blind to retrieval quality, especially in realistic, cross-repo settings.
A GraphRAG system changes the game. Instead of “generate a function in this file,” your system operates more like:
“Given this question about our platform, find and reason over the relevant code, configs, schemas, and policies scattered across dozens of repositories.”
The assumptions behind standard benchmarks fall apart:
| Benchmark Assumption | GraphRAG Reality |
|---|---|
| Single file context | 25 repositories, 50,000+ files |
| Self-contained problems | Cross-service dependencies |
| Language-homogeneous | Polyglot (Python, TypeScript, SQL, YAML, etc.) |
| Static evaluation | Repositories evolve over time |
| Function-level generation | Multi-file, sometimes multi-repo modifications |
A GraphRAG architecture (like Microsoft’s GraphRAG designs) organizes code and artifacts into a graph of entities and relationships—services, APIs, database tables, configuration, policies, and more—then uses retrieval over that graph as the first-class operation. Generation is conditioned on what’s retrieved.
That’s a fundamentally different task from “complete this function in foo.py.”
You might think: OK, but what about SWE-Bench and other repository-level benchmarks? Those are more realistic, right?
They’re a step forward, but still not enough.
Work like CodeRAG-Bench has shown a 9-point performance gap between:
In other words: even at single-repository scale, retrieval is hard. Once you move to cross-repository retrieval, everything gets worse:
Your GraphRAG context graph and retrieval pipeline are built precisely to navigate this complexity. But no existing benchmark is designed to test whether your system actually does this well.
Cross-repository queries introduce failure modes that standard benchmarks simply cannot surface.
The same conceptual entity—User, Authentication, Payment, etc.—shows up in many places:
user-service for core identity and authweb-frontend managing sessions and tokensbilling-service tying users to subscriptions or invoicescompliance repository encoding security constraintsA query like “explain the user authentication flow” isn’t just about one function or one service. A good GraphRAG pipeline must resolve:
Standard benchmarks have no notion of entity resolution across heterogeneous codebases. They assume a single, local context where entities live in one place.
Real systems are layered and interconnected:
A GraphRAG system builds (or approximates) a dependency graph that says:
Correct retrieval requires walking these paths:
Standard benchmarks have no mechanism to test whether the system walks the right edges in this graph. At best, they test “did you edit the right file in this one repo?“
In a multi-repo, actively developed system:
A GraphRAG system that naïvely pulls:
main,…will produce an internally incoherent context, leading to incorrect reasoning or suggestions.
Evaluating version coherence means asking:
Standard benchmarks almost never model time, branches, or deployment states. They treat the repository as a fixed snapshot.
Given all of this, standard benchmarks play at most a supporting role:
They are good sanity checks:
They are not meaningful primary metrics:
Relying on HumanEval, MBPP, or SWE-Bench scores to measure a GraphRAG system is like using a single-unit test to validate a distributed system: it might fail if things are badly broken, but it won’t tell you if the real system behavior is correct.
To evaluate a cross-repository GraphRAG system, you need metrics that align with its actual responsibilities.
At a minimum:
Beyond these, consider additional evaluative dimensions:
Structure your evaluation plan around these axes, and treat standard single-repo benchmarks as ancillary sanity checks rather than primary determinants of success.
These references provide the grounding for understanding why cross-repository context changes everything and how a GraphRAG system is architected to navigate it.