Implementing Retrieval Evaluation for GraphRAG
How to actually measure whether your graph-based retriever is doing the right thing
If you are building a GraphRAG system, retrieval is where most of the real intelligence lives.
Generation gets all the attention. Retrieval decides whether generation even has a chance to be correct.
This post focuses on one narrow but critical question.
How do you evaluate the retrieval layer of a GraphRAG system in a way that reflects how graph traversal actually works.
Not generic vector search.
Not surface-level recall metrics.
But evaluation that tells you whether the graph was traversed correctly across repositories, versions, and relationships.
Why retrieval evaluation is harder in GraphRAG than normal RAG
Let’s start simple.
In a classic RAG setup, retrieval usually means embedding the query, searching a vector index, and returning the top K chunks.
Evaluation is mostly about ranking quality.
GraphRAG changes the shape of the problem.
Now retrieval involves starting nodes inferred from the query, graph traversal across relationships, multiple repositories and files, version and branch constraints, and finally a set of documents produced by those paths.
So when retrieval fails, you need to know why.
- Did it miss the right documents.
- Did it traverse the wrong edges.
- Did it reach the right repo but the wrong version.
- Did it explode into noisy paths.
If your metrics cannot tell you that, they are not useful.
The retrieval evaluation pipeline in practice
At a high level, retrieval evaluation for GraphRAG has three parallel tracks.
First, you evaluate what was retrieved.
Second, you evaluate how the graph was traversed.
Third, you aggregate those signals so you can compare changes over time.
Conceptually, the pipeline looks like this.
Query goes into the GraphRAG retriever.
The retriever produces documents and traversal logs.
Retrieved documents are compared against gold labels.
Traversal logs are compared against expected graph paths.
Metrics are aggregated across queries and query types.
Sounds straightforward. The details are where things get tricky.
Step one: defining a gold standard that actually works
Before metrics, you need labels that make sense for graphs.
A flat list of correct documents is not enough.
For each query, you usually want to capture three things.
Essential documents.
These are required for a correct answer. Missing them is a hard failure.Helpful documents.
Nice-to-have context that improves answers but is not strictly required.Essential repositories.
This matters in multi-repo systems. Sometimes hitting the right repo is more important than the exact file.
This lets you grade retrieval instead of treating it as binary.
Minimal structure that works well in practice:
@dataclass
class GoldStandard:
query_id: str
essential_docs: Set[str]
helpful_docs: Set[str]
essential_repos: Set[str]This alone already improves evaluation quality compared to most setups.
Measuring ranking quality with graded relevance
What it is
Normalized Discounted Cumulative Gain measures how well ranked results align with relevance.
Why it exists
Retrieving the right document at position one is far more valuable than at position ten.
How it works technically
Each document gets a relevance score. Relevance is discounted as rank increases using a log scale. The result is normalized against an ideal ranking.
In GraphRAG, relevance is rarely binary. Some documents are critical. Others are just helpful.
That maps cleanly to graded relevance.
def calculate_ndcg(retrieved, essential, helpful, k=10):
def relevance(doc):
if doc in essential:
return 2
if doc in helpful:
return 1
return 0Real-world analogy
Think of searching a codebase for a bug.
The file containing the bug is essential.
The config file explaining behavior is helpful.
A random utility file is noise.
NDCG captures that difference.
Recall still matters, but only for essentials
Recall answers a simple question.
Did we retrieve what we absolutely needed.
For GraphRAG, recall should usually focus only on essential documents.
Why.
Because retrieving ten helpful files does not compensate for missing the one file that defines the behavior.
def calculate_recall(retrieved, essential, k=20):
retrieved_set = set(retrieved[:k])
return len(retrieved_set & essential) / len(essential)This metric catches silent failures that NDCG sometimes hides.
MRR tells you how fast the graph found the truth
Mean Reciprocal Rank answers another important question.
How early did the system encounter something essential.
In graph traversal, this often correlates with path quality.
If the first essential document appears late, traversal likely wandered.
def calculate_mrr(retrieved, essential):
for i, doc in enumerate(retrieved):
if doc in essential:
return 1.0 / (i + 1)
return 0.0Sounds simple. It is surprisingly diagnostic.
Cross-repository precision is GraphRAG-specific
Here is where GraphRAG diverges from normal RAG.
Sometimes the exact file does not matter yet.
Reaching the correct repository does.
What it is
The fraction of unique retrieved repositories that are actually relevant.
Why it exists
Wrong-repo traversal is one of the most common GraphRAG failure modes.
How it works
Deduplicate repositories in rank order, take the first K, and check how many are essential.
def calculate_cross_repo_precision(retrieved_repos, essential_repos, k=5):
unique = []
for r in retrieved_repos:
if r not in unique:
unique.append(r)
if len(unique) >= k:
break
return sum(1 for r in unique if r in essential_repos) / len(unique)If this drops, your graph routing logic is probably off.
Logging and evaluating graph traversal itself
Retrieval outputs are only half the story.
GraphRAG systems should log traversal paths.
Not just final results. The actual edges and nodes visited.
Minimal traversal log:
@dataclass
class GraphTraversalLog:
query_id: str
start_nodes: List[str]
traversed_edges: List[tuple]
final_nodes: List[str]Gold traversal annotation:
@dataclass
class GoldGraphPath:
query_id: str
expected_edges: List[tuple]
expected_nodes: Set[str]Now you can measure things that normal RAG never sees.
Edge recall tells you if the graph walked the right path
What it is
Fraction of required edges that were actually traversed.
Why it exists
In cross-repo retrieval, the path often matters more than the endpoint.
Low edge recall means embeddings are not the problem. Graph logic is.
Node precision exposes noisy traversal
Node precision answers a painful question.
How much junk did the graph pull in along the way.
Low precision usually means overly permissive traversal rules, missing depth limits, or poorly weighted relationships.
Version coherence prevents subtle correctness bugs
GraphRAG systems often mix branches or commits if you are not careful.
Path coherence checks whether all retrieved nodes belong to consistent versions.
If the same repository appears with conflicting versions, the score collapses to zero.
This catches bugs that only show up in production.
Putting everything together in an evaluation harness
At this point, you have many metrics. You need orchestration.
A typical evaluator runs retrieval with logging, computes standard metrics, computes graph-specific metrics, and aggregates results by query type.
The key idea.
Always evaluate per query type, not just overall averages.
Cross-repo queries behave very differently from single-repo ones.
Your final output should answer questions like:
Did edge recall improve for dependency-chain queries.
Did cross-repo precision regress after graph changes.
Did version coherence break after a refactor.
That is how retrieval evaluation becomes actionable instead of decorative.
What good results actually look like
Healthy GraphRAG systems usually show:
- High recall for essential documents.
- NDCG that improves steadily with traversal tuning.
- Cross-repo precision above random baselines.
- High edge recall on annotated paths.
- Near-perfect version coherence.
When one of these drops, you know exactly where to look.
And that is the real goal.
Not a single number.
But a clear signal that tells you whether your graph is doing what you think it is doing.
If you are serious about GraphRAG quality, this layer deserves as much attention as generation.
