Evaluating Retrieval Quality in Cross-Repository GraphRAG Systems

A practical guide to measuring retrieval quality in GraphRAG systems operating across multiple repositories. Covers gold-standard design, graded relevance metrics, cross-repository precision, graph traversal evaluation, and version coherence to ensure correct multi-repo retrieval.

Evaluating Retrieval Quality in Cross-Repository GraphRAG Systems

Implementing Retrieval Evaluation for GraphRAG

How to actually measure whether your graph-based retriever is doing the right thing

If you are building a GraphRAG system, retrieval is where most of the real intelligence lives.

Generation gets all the attention. Retrieval decides whether generation even has a chance to be correct.

This post focuses on one narrow but critical question.
How do you evaluate the retrieval layer of a GraphRAG system in a way that reflects how graph traversal actually works.

Not generic vector search.
Not surface-level recall metrics.
But evaluation that tells you whether the graph was traversed correctly across repositories, versions, and relationships.


Why retrieval evaluation is harder in GraphRAG than normal RAG

Let’s start simple.

In a classic RAG setup, retrieval usually means embedding the query, searching a vector index, and returning the top K chunks.

Evaluation is mostly about ranking quality.

GraphRAG changes the shape of the problem.

Now retrieval involves starting nodes inferred from the query, graph traversal across relationships, multiple repositories and files, version and branch constraints, and finally a set of documents produced by those paths.

So when retrieval fails, you need to know why.

If your metrics cannot tell you that, they are not useful.


The retrieval evaluation pipeline in practice

At a high level, retrieval evaluation for GraphRAG has three parallel tracks.

First, you evaluate what was retrieved.
Second, you evaluate how the graph was traversed.
Third, you aggregate those signals so you can compare changes over time.

Conceptually, the pipeline looks like this.

Query goes into the GraphRAG retriever.
The retriever produces documents and traversal logs.
Retrieved documents are compared against gold labels.
Traversal logs are compared against expected graph paths.
Metrics are aggregated across queries and query types.

Sounds straightforward. The details are where things get tricky.


Step one: defining a gold standard that actually works

Before metrics, you need labels that make sense for graphs.

A flat list of correct documents is not enough.

For each query, you usually want to capture three things.

This lets you grade retrieval instead of treating it as binary.

Minimal structure that works well in practice:

    @dataclass
    class GoldStandard:
        query_id: str
        essential_docs: Set[str]
        helpful_docs: Set[str]
        essential_repos: Set[str]

This alone already improves evaluation quality compared to most setups.


Measuring ranking quality with graded relevance

What it is

Normalized Discounted Cumulative Gain measures how well ranked results align with relevance.

Why it exists

Retrieving the right document at position one is far more valuable than at position ten.

How it works technically

Each document gets a relevance score. Relevance is discounted as rank increases using a log scale. The result is normalized against an ideal ranking.

In GraphRAG, relevance is rarely binary. Some documents are critical. Others are just helpful.

That maps cleanly to graded relevance.

    def calculate_ndcg(retrieved, essential, helpful, k=10):
        def relevance(doc):
            if doc in essential:
                return 2
            if doc in helpful:
                return 1
            return 0

Real-world analogy

Think of searching a codebase for a bug.

The file containing the bug is essential.
The config file explaining behavior is helpful.
A random utility file is noise.

NDCG captures that difference.


Recall still matters, but only for essentials

Recall answers a simple question.
Did we retrieve what we absolutely needed.

For GraphRAG, recall should usually focus only on essential documents.

Why.
Because retrieving ten helpful files does not compensate for missing the one file that defines the behavior.

    def calculate_recall(retrieved, essential, k=20):
        retrieved_set = set(retrieved[:k])
        return len(retrieved_set & essential) / len(essential)

This metric catches silent failures that NDCG sometimes hides.


MRR tells you how fast the graph found the truth

Mean Reciprocal Rank answers another important question.

How early did the system encounter something essential.

In graph traversal, this often correlates with path quality.

If the first essential document appears late, traversal likely wandered.

    def calculate_mrr(retrieved, essential):
        for i, doc in enumerate(retrieved):
            if doc in essential:
                return 1.0 / (i + 1)
        return 0.0

Sounds simple. It is surprisingly diagnostic.


Cross-repository precision is GraphRAG-specific

Here is where GraphRAG diverges from normal RAG.

Sometimes the exact file does not matter yet.
Reaching the correct repository does.

What it is

The fraction of unique retrieved repositories that are actually relevant.

Why it exists

Wrong-repo traversal is one of the most common GraphRAG failure modes.

How it works

Deduplicate repositories in rank order, take the first K, and check how many are essential.

    def calculate_cross_repo_precision(retrieved_repos, essential_repos, k=5):
        unique = []
        for r in retrieved_repos:
            if r not in unique:
                unique.append(r)
            if len(unique) >= k:
                break
        return sum(1 for r in unique if r in essential_repos) / len(unique)

If this drops, your graph routing logic is probably off.


Logging and evaluating graph traversal itself

Retrieval outputs are only half the story.

GraphRAG systems should log traversal paths.

Not just final results. The actual edges and nodes visited.

Minimal traversal log:

    @dataclass
    class GraphTraversalLog:
        query_id: str
        start_nodes: List[str]
        traversed_edges: List[tuple]
        final_nodes: List[str]

Gold traversal annotation:

    @dataclass
    class GoldGraphPath:
        query_id: str
        expected_edges: List[tuple]
        expected_nodes: Set[str]

Now you can measure things that normal RAG never sees.


Edge recall tells you if the graph walked the right path

What it is

Fraction of required edges that were actually traversed.

Why it exists

In cross-repo retrieval, the path often matters more than the endpoint.

Low edge recall means embeddings are not the problem. Graph logic is.


Node precision exposes noisy traversal

Node precision answers a painful question.

How much junk did the graph pull in along the way.

Low precision usually means overly permissive traversal rules, missing depth limits, or poorly weighted relationships.


Version coherence prevents subtle correctness bugs

GraphRAG systems often mix branches or commits if you are not careful.

Path coherence checks whether all retrieved nodes belong to consistent versions.

If the same repository appears with conflicting versions, the score collapses to zero.

This catches bugs that only show up in production.


Putting everything together in an evaluation harness

At this point, you have many metrics. You need orchestration.

A typical evaluator runs retrieval with logging, computes standard metrics, computes graph-specific metrics, and aggregates results by query type.

The key idea.

Always evaluate per query type, not just overall averages.

Cross-repo queries behave very differently from single-repo ones.

Your final output should answer questions like:

Did edge recall improve for dependency-chain queries.
Did cross-repo precision regress after graph changes.
Did version coherence break after a refactor.

That is how retrieval evaluation becomes actionable instead of decorative.


What good results actually look like

Healthy GraphRAG systems usually show:

When one of these drops, you know exactly where to look.

And that is the real goal.

Not a single number.
But a clear signal that tells you whether your graph is doing what you think it is doing.

If you are serious about GraphRAG quality, this layer deserves as much attention as generation.