Evaluating Retrieval Quality in Cross-Repository GraphRAG Systems

# Implementing Retrieval Evaluation for GraphRAG How to actually measure whether your graph-based retriever is doing the right thing If you are building a GraphRAG system, retrieval is where most of the real intelligence lives. Generation gets all the attention. Retrieval decides whether generation even has a chance to be correct. This post focuses on one narrow but critical question. How do you evaluate the retrieval layer of a GraphRAG system in a way that reflects how graph traversal actually works. Not generic vector search. Not surface-level recall metrics. But evaluation that tells you whether the graph was traversed correctly across repositories, versions, and relationships. --- ## Why retrieval evaluation is harder in GraphRAG than normal RAG Let’s start simple. In a classic RAG setup, retrieval usually means embedding the query, searching a vector index, and returning the top K chunks. Evaluation is mostly about ranking quality. GraphRAG changes the shape of the problem. Now retrieval involves starting nodes inferred from the query, graph traversal across relationships, multiple repositories and files, version and branch constraints, and finally a set of documents produced by those paths. So when retrieval fails, you need to know why. - Did it miss the right documents. - Did it traverse the wrong edges. - Did it reach the right repo but the wrong version. - Did it explode into noisy paths. If your metrics cannot tell you that, they are not useful. --- ## The retrieval evaluation pipeline in practice At a high level, retrieval evaluation for GraphRAG has three parallel tracks. First, you evaluate what was retrieved. Second, you evaluate how the graph was traversed. Third, you aggregate those signals so you can compare changes over time. Conceptually, the pipeline looks like this. Query goes into the GraphRAG retriever. The retriever produces documents and traversal logs. Retrieved documents are compared against gold labels. Traversal logs are compared against expected graph paths. Metrics are aggregated across queries and query types. Sounds straightforward. The details are where things get tricky. --- ## Step one: defining a gold standard that actually works Before metrics, you need labels that make sense for graphs. A flat list of correct documents is not enough. For each query, you usually want to capture three things. - Essential documents. These are required for a correct answer. Missing them is a hard failure. - Helpful documents. Nice-to-have context that improves answers but is not strictly required. - Essential repositories. This matters in multi-repo systems. Sometimes hitting the right repo is more important than the exact file. This lets you grade retrieval instead of treating it as binary. Minimal structure that works well in practice: ``` @dataclass class GoldStandard: query_id: str essential_docs: Set[str] helpful_docs: Set[str] essential_repos: Set[str] ``` This alone already improves evaluation quality compared to most setups. --- ## Measuring ranking quality with graded relevance ### What it is Normalized Discounted Cumulative Gain measures how well ranked results align with relevance. ### Why it exists Retrieving the right document at position one is far more valuable than at position ten. ### How it works technically Each document gets a relevance score. Relevance is discounted as rank increases using a log scale. The result is normalized against an ideal ranking. In GraphRAG, relevance is rarely binary. Some documents are critical. Others are just helpful. That maps cleanly to graded relevance. ``` def calculate_ndcg(retrieved, essential, helpful, k=10): def relevance(doc): if doc in essential: return 2 if doc in helpful: return 1 return 0 ``` ### Real-world analogy Think of searching a codebase for a bug. The file containing the bug is essential. The config file explaining behavior is helpful. A random utility file is noise. NDCG captures that difference. --- ## Recall still matters, but only for essentials Recall answers a simple question. Did we retrieve what we absolutely needed. For GraphRAG, recall should usually focus only on essential documents. Why. Because retrieving ten helpful files does not compensate for missing the one file that defines the behavior. ``` def calculate_recall(retrieved, essential, k=20): retrieved_set = set(retrieved[:k]) return len(retrieved_set & essential) / len(essential) ``` This metric catches silent failures that NDCG sometimes hides. --- ## MRR tells you how fast the graph found the truth Mean Reciprocal Rank answers another important question. How early did the system encounter something essential. In graph traversal, this often correlates with path quality. If the first essential document appears late, traversal likely wandered. ``` def calculate_mrr(retrieved, essential): for i, doc in enumerate(retrieved): if doc in essential: return 1.0 / (i + 1) return 0.0 ``` Sounds simple. It is surprisingly diagnostic. --- ## Cross-repository preci

Evaluating Retrieval Quality in Cross-Repository GraphRAG Systems

Implementing Retrieval Evaluation for GraphRAG

Why retrieval evaluation is harder in GraphRAG than normal RAG

The retrieval evaluation pipeline in practice

Step one: defining a gold standard that actually works

Measuring ranking quality with graded relevance

What it is

Why it exists

How it works technically

Real-world analogy

Recall still matters, but only for essentials

MRR tells you how fast the graph found the truth

Cross-repository precision is GraphRAG-specific

What it is

Why it exists

How it works

Logging and evaluating graph traversal itself

Edge recall tells you if the graph walked the right path

What it is

Why it exists

Node precision exposes noisy traversal

Version coherence prevents subtle correctness bugs

Putting everything together in an evaluation harness

What good results actually look like