End-to-End System Evaluation: The Stress Test of GraphRAG

# End to End Evaluation for GraphRAG Systems That Solve Cross Repository Problems This post shows how to run end to end evaluation for a GraphRAG system and how to read the results so you can improve the system. If your work involves resolving cross repository questions across many repos, this is for you. I will use plain language, code you can copy, and actionable insights you can apply right away. Here is the short story first. A high score on an isolated benchmark does not mean your system can find the right file across 25 repositories and then synthesize code that actually compiles. The real work is making sure retrieval, traversal, and generation all play nicely together when queries demand context from multiple repositories. That is the cross repository problem and it is the core keyword to optimize for in this post. ## Why end to end evaluation matters for cross repository systems What it is A full end to end evaluation measures the system from user query to final answer. That means retrieval, any graph traversal that connects repositories, reranking, and generation all get evaluated as a single flow. Why it exists If retrieval misses a key file, the generator will invent plausible but wrong code. If the retriever is good but traversal picks poor neighbors, answers will reference the wrong API. End to end evaluation shows how these failures compound. How it works You run the pipeline on a representative dataset. For each test case you store the oracle solution or the expected code paths. The pipeline runs retrieval metrics like Hit Rate and NDCG then runs generation checks such as groundedness and code execution pass rates. Finally you correlate errors across layers. Real world example Imagine a bug fix that needs a database schema in repo A, an ORM helper in repo B, and a migration script in repo C. A successful end to end test must retrieve and surface those three anchors and then produce a change that compiles against those APIs. Here is the tricky part. People often treat retrieval metrics and generation metrics separately. That is useful. But if you do not measure them together you will not know whether a poor system pass is a retrieval problem or a generation problem. ## Orchestration code you can use Below is an end to end evaluation orchestrator in Python. It is ready to plug into your existing GraphRAG system. The code runs retrieval and generation evaluations then produces a combined report and supports an ablation study to measure component contributions. ```python from typing import List, Dict from dataclasses import dataclass from datetime import datetime # Minimal type structures for clarity @dataclass class EvaluationReport: retrieval: Dict generation: Dict metadata: Dict @dataclass class Insight: severity: str component: str finding: str recommendation: str class GraphRAGEvaluationPipeline: """ Orchestrate complete evaluation across all layers. """ def __init__(self, graphrag_system, evaluation_dataset: List[Dict], config: Dict): self.system = graphrag_system self.dataset = evaluation_dataset self.config = config # Initialize evaluators that you must implement or inject self.retrieval_evaluator = GraphRAGRetrievalEvaluator( graphrag_retriever = graphrag_system.retriever, gold_dataset = evaluation_dataset, k_values = config.get("k_values", [5, 10, 20]) ) self.generation_evaluator = GraphRAGGenerationEvaluator( graphrag_system = graphrag_system, gold_dataset = evaluation_dataset, repository_mocks = config.get("repository_mocks", ) ) def run_full_evaluation(self) -> EvaluationReport: """Run complete evaluation pipeline.""" print("Evaluating retrieval layer...") retrieval_results = self.retrieval_evaluator.evaluate_all() print("Evaluating generation layer...") generation_results = self.generation_evaluator.evaluate_all() return EvaluationReport( retrieval = retrieval_results, generation = generation_results, metadata = ) def run_ablation_study(self) -> Dict[str, EvaluationReport]: """ Compare system variants to identify component contributions. """ variants = results = for name, variant in variants.items(): print(f"Evaluating variant: ") pipeline = GraphRAGEvaluationPipeline(variant, self.dataset, self.config) results[name] = pipeline.run_full_evaluation() return results ``` Notes on wiring this into your infra * The retrieval evaluator should return both aggregate metrics and per query type metrics * The generation evaluator must be able to run code checks such as unit tests or lightweight execution * repository_mocks is the place to provide reproducible code environment for generation tests ## How to interpret the results I will cover the key metrics and then give a short recipe for turning insights into work items. ### NDCG explained **What it is** - Normalized Discounted Cumulative Gain or NDCG measures ranking quality. It rewards putting highly relevant documents near the top. **Why it matters for cross repository** - When queries require multi hop context, the most relevant file mig

End-to-End System Evaluation: The Stress Test of GraphRAG

End to End Evaluation for GraphRAG Systems That Solve Cross Repository Problems

Why end to end evaluation matters for cross repository systems

Orchestration code you can use

How to interpret the results

NDCG explained

Cross repository precision

Groundedness score

Code execution pass rate

Interpreting cross layer signals

Results to include in your reports

How to run an ablation study and what to look for

Practical playbook to turn insights into work items

Reporting template you can copy

Sample comparison report generator

Final notes on building trust with evaluation