Designing a Three-Layer Evaluation Framework for Cross-Repository GraphRAG

# Designing a Three-Layer Evaluation Framework for Cross-Repository GraphRAG Standard evaluation approaches treat retrieval-augmented generation as a monolithic system: query goes in, answer comes out, measure quality. This works poorly for GraphRAG systems operating across multiple repositories, where failures can occur at distinct stages with very different root causes. A retrieval failure looks different from a reasoning failure, which looks different from a generation failure. Conflating them makes debugging nearly impossible and optimization directionless. This post introduces a three-layer evaluation architecture that separates these concerns: 1. **Retrieval Layer**: Can the system find the right files and chunks across repositories? 2. **Reasoning Layer**: Can the system correctly combine information from multiple sources? 3. **Generation Layer**: Is the final output correct and grounded in retrieved context? Each layer has distinct metrics, target thresholds, and evaluation methodologies. Together, they provide the diagnostic granularity you need to systematically improve a cross-repository GraphRAG system. --- ## Why Three Layers? Consider a failure case: the system produces an incorrect explanation of your authentication flow. Without layer separation, you face a debugging maze: - Did retrieval miss the relevant auth service files? - Did retrieval find them, but reasoning failed to connect frontend and backend components? - Did reasoning succeed, but generation hallucinated implementation details? Each root cause requires a different intervention: | Failure Layer | Symptom | Intervention | |---------------|---------|--------------| | Retrieval | Relevant files never reached the context | Improve embeddings, graph construction, or query expansion | | Reasoning | Files retrieved but relationships misunderstood | Improve context ordering, add explicit relationship prompts | | Generation | Context correct but output fabricated details | Adjust generation prompts, add grounding constraints | A three-layer framework surfaces which layer failed, enabling targeted improvement rather than blind experimentation. --- ## Layer 1: Retrieval Evaluation ### What You're Measuring Can the GraphRAG system find the correct files and chunks across 25 repositories given a natural language query? This is the foundation. If retrieval fails, no amount of reasoning or generation sophistication will save you. The system cannot synthesize information it never retrieved. ### Core Metrics | Metric | Definition | Target for Production | |--------|------------|----------------------| | NDCG@k | Normalized discounted cumulative gain at k documents | > 0.75 for k=10 | | Recall@k | Fraction of relevant documents in top k | > 0.85 for k=20 | | MRR | Mean reciprocal rank of first relevant document | > 0.6 | | Cross-Repo Precision | Fraction of retrieved repos that contain relevant docs | > 0.7 | | Relationship Traversal Accuracy | Did we follow the correct graph edges? | > 0.8 | **NDCG@k** captures ranking quality: are the most relevant documents at the top? For cross-repository queries, this matters because context windows are limited—you need the best files first. **Recall@k** measures coverage: did we find all the relevant files within the top k? A system with high precision but low recall will miss critical context. **MRR** focuses on the first relevant result. For many queries, finding one highly relevant file quickly is more important than perfect ranking of the full list. **Cross-Repo Precision** is specific to multi-repository settings. If a query requires files from `auth-service` and `user-frontend`, but retrieval returns files from `billing-service` and `analytics`, you have a repository-level targeting problem distinct from file-level ranking. ### GraphRAG-Specific: Relationship Traversal Accuracy Standard retrieval metrics assume a flat document collection. Your GraphRAG system uses a knowledge graph, so you need metrics that evaluate graph traversal quality. Consider a query: "How does user authentication flow from the frontend to the database?" The correct retrieval path might traverse: ``` [frontend/auth/LoginForm.tsx] --calls--> [backend/api/auth.py] --imports--> [backend/services/user_service.py] --queries--> [database/schemas/users.sql] ``` Did your system follow this path, or did it jump to unrelated nodes? ```python from typing import List, Set, Tuple def graph_traversal_precision( retrieved_paths: List[List[str]], # Paths through the graph gold_paths: List[List[str]] # Annotated correct paths ) -> float: """ Measure whether the system traversed correct relationships. Example gold path: [user_service/models.py] --imports--> [shared/types.py] --defines--> [User type] --referenced_by--> [frontend/api.ts] """ correct_edges = 0 total_edges = sum(len(p) - 1 for p in gold_paths) for gold_path in gold_paths: for retrieved_path in retrieved_paths: correct_edges += count_matching_edges(gold_path, retrieved_path) return c

Failure Layer	Symptom	Intervention
Retrieval	Relevant files never reached the context	Improve embeddings, graph construction, or query expansion
Reasoning	Files retrieved but relationships misunderstood	Improve context ordering, add explicit relationship prompts
Generation	Context correct but output fabricated details	Adjust generation prompts, add grounding constraints

Metric	Definition	Target for Production
NDCG@k	Normalized discounted cumulative gain at k documents	> 0.75 for k=10
Recall@k	Fraction of relevant documents in top k	> 0.85 for k=20
MRR	Mean reciprocal rank of first relevant document	> 0.6
Cross-Repo Precision	Fraction of retrieved repos that contain relevant docs	> 0.7
Relationship Traversal Accuracy	Did we follow the correct graph edges?	> 0.8

Metric	Definition	Target
Context Coherence Score	Are retrieved chunks from consistent versions/branches?	> 0.9
Information Integration	Does the response synthesize multiple sources?	> 0.7
Conflict Resolution	When sources disagree, is the resolution correct?	> 0.6
Completeness	Are all relevant aspects of the query addressed?	> 0.75

Metric	Definition	Target
Factual Correctness	Does the code/explanation work?	> 0.85
Groundedness	Is every claim traceable to retrieved context?	> 0.9
Hallucination Rate	Fraction of claims not supported by context	< 0.1
Repository Attribution	Are sources correctly cited?	> 0.95

Layer	Metric	Target	Critical Threshold
Retrieval	NDCG@10	> 0.75	< 0.5 = severe
	Recall@20	> 0.85	< 0.6 = severe
	MRR	> 0.6	< 0.3 = severe
	Cross-Repo Precision	> 0.7	< 0.5 = severe
	Graph Traversal Accuracy	> 0.8	< 0.5 = severe
Reasoning	Context Coherence	> 0.9	< 0.7 = severe
	Information Integration	> 0.7	< 0.5 = severe
	Completeness	> 0.75	< 0.5 = severe
Generation	Factual Correctness	> 0.85	< 0.7 = severe
	Groundedness	> 0.9	< 0.8 = severe
	Hallucination Rate	< 0.1	> 0.2 = severe
	Attribution Accuracy	> 0.95	< 0.85 = severe

Designing a Three-Layer Evaluation Framework for Cross-Repository GraphRAG

Designing a Three-Layer Evaluation Framework for Cross-Repository GraphRAG

Why Three Layers?

Layer 1: Retrieval Evaluation

What You’re Measuring

Core Metrics

GraphRAG-Specific: Relationship Traversal Accuracy

Cross-Repository Retrieval Precision

Layer 2: Reasoning Evaluation

What You’re Measuring

Core Metrics

Context Coherence Score

Information Integration Assessment

Aggregation Challenge Evaluation

Layer 3: Generation Evaluation

What You’re Measuring

Core Metrics

Execution-Based Evaluation for Code

Groundedness Evaluation

Repository Attribution Accuracy

Putting It All Together: The Evaluation Pipeline

Target Thresholds Summary

Key Takeaways

Next Steps

References