A practical guide to measuring retrieval quality in GraphRAG systems operating across multiple repositories. Covers gold-standard design, graded relevance metrics, cross-repository precision, graph traversal evaluation, and version coherence to ensure correct multi-repo retrieval.

How to actually measure whether your graph-based retriever is doing the right thing
If you are building a GraphRAG system, retrieval is where most of the real intelligence lives.
Generation gets all the attention. Retrieval decides whether generation even has a chance to be correct.
This post focuses on one narrow but critical question.
How do you evaluate the retrieval layer of a GraphRAG system in a way that reflects how graph traversal actually works.
Not generic vector search.
Not surface-level recall metrics.
But evaluation that tells you whether the graph was traversed correctly across repositories, versions, and relationships.
Let’s start simple.
In a classic RAG setup, retrieval usually means embedding the query, searching a vector index, and returning the top K chunks.
Evaluation is mostly about ranking quality.
GraphRAG changes the shape of the problem.
Now retrieval involves starting nodes inferred from the query, graph traversal across relationships, multiple repositories and files, version and branch constraints, and finally a set of documents produced by those paths.
So when retrieval fails, you need to know why.
If your metrics cannot tell you that, they are not useful.
At a high level, retrieval evaluation for GraphRAG has three parallel tracks.
First, you evaluate what was retrieved.
Second, you evaluate how the graph was traversed.
Third, you aggregate those signals so you can compare changes over time.
Conceptually, the pipeline looks like this.
Query goes into the GraphRAG retriever.
The retriever produces documents and traversal logs.
Retrieved documents are compared against gold labels.
Traversal logs are compared against expected graph paths.
Metrics are aggregated across queries and query types.
Sounds straightforward. The details are where things get tricky.
Before metrics, you need labels that make sense for graphs.
A flat list of correct documents is not enough.
For each query, you usually want to capture three things.
Essential documents.
These are required for a correct answer. Missing them is a hard failure.
Helpful documents.
Nice-to-have context that improves answers but is not strictly required.
Essential repositories.
This matters in multi-repo systems. Sometimes hitting the right repo is more important than the exact file.
This lets you grade retrieval instead of treating it as binary.
Minimal structure that works well in practice:
@dataclass
class GoldStandard:
query_id: str
essential_docs: Set[str]
helpful_docs: Set[str]
essential_repos: Set[str]This alone already improves evaluation quality compared to most setups.
Normalized Discounted Cumulative Gain measures how well ranked results align with relevance.
Retrieving the right document at position one is far more valuable than at position ten.
Each document gets a relevance score. Relevance is discounted as rank increases using a log scale. The result is normalized against an ideal ranking.
In GraphRAG, relevance is rarely binary. Some documents are critical. Others are just helpful.
That maps cleanly to graded relevance.
def calculate_ndcg(retrieved, essential, helpful, k=10):
def relevance(doc):
if doc in essential:
return 2
if doc in helpful:
return 1
return 0Think of searching a codebase for a bug.
The file containing the bug is essential.
The config file explaining behavior is helpful.
A random utility file is noise.
NDCG captures that difference.
Recall answers a simple question.
Did we retrieve what we absolutely needed.
For GraphRAG, recall should usually focus only on essential documents.
Why.
Because retrieving ten helpful files does not compensate for missing the one file that defines the behavior.
def calculate_recall(retrieved, essential, k=20):
retrieved_set = set(retrieved[:k])
return len(retrieved_set & essential) / len(essential)This metric catches silent failures that NDCG sometimes hides.
Mean Reciprocal Rank answers another important question.
How early did the system encounter something essential.
In graph traversal, this often correlates with path quality.
If the first essential document appears late, traversal likely wandered.
def calculate_mrr(retrieved, essential):
for i, doc in enumerate(retrieved):
if doc in essential:
return 1.0 / (i + 1)
return 0.0Sounds simple. It is surprisingly diagnostic.
Here is where GraphRAG diverges from normal RAG.
Sometimes the exact file does not matter yet.
Reaching the correct repository does.
The fraction of unique retrieved repositories that are actually relevant.
Wrong-repo traversal is one of the most common GraphRAG failure modes.
Deduplicate repositories in rank order, take the first K, and check how many are essential.
def calculate_cross_repo_precision(retrieved_repos, essential_repos, k=5):
unique = []
for r in retrieved_repos:
if r not in unique:
unique.append(r)
if len(unique) >= k:
break
return sum(1 for r in unique if r in essential_repos) / len(unique)If this drops, your graph routing logic is probably off.
Retrieval outputs are only half the story.
GraphRAG systems should log traversal paths.
Not just final results. The actual edges and nodes visited.
Minimal traversal log:
@dataclass
class GraphTraversalLog:
query_id: str
start_nodes: List[str]
traversed_edges: List[tuple]
final_nodes: List[str]Gold traversal annotation:
@dataclass
class GoldGraphPath:
query_id: str
expected_edges: List[tuple]
expected_nodes: Set[str]Now you can measure things that normal RAG never sees.
Fraction of required edges that were actually traversed.
In cross-repo retrieval, the path often matters more than the endpoint.
Low edge recall means embeddings are not the problem. Graph logic is.
Node precision answers a painful question.
How much junk did the graph pull in along the way.
Low precision usually means overly permissive traversal rules, missing depth limits, or poorly weighted relationships.
GraphRAG systems often mix branches or commits if you are not careful.
Path coherence checks whether all retrieved nodes belong to consistent versions.
If the same repository appears with conflicting versions, the score collapses to zero.
This catches bugs that only show up in production.
At this point, you have many metrics. You need orchestration.
A typical evaluator runs retrieval with logging, computes standard metrics, computes graph-specific metrics, and aggregates results by query type.
The key idea.
Always evaluate per query type, not just overall averages.
Cross-repo queries behave very differently from single-repo ones.
Your final output should answer questions like:
Did edge recall improve for dependency-chain queries.
Did cross-repo precision regress after graph changes.
Did version coherence break after a refactor.
That is how retrieval evaluation becomes actionable instead of decorative.
Healthy GraphRAG systems usually show:
When one of these drops, you know exactly where to look.
And that is the real goal.
Not a single number.
But a clear signal that tells you whether your graph is doing what you think it is doing.
If you are serious about GraphRAG quality, this layer deserves as much attention as generation.