Before creating evaluation datasets for a GraphRAG system, you must understand your codebase topology. This post walks through building repository dependency graphs, classifying repos by role, mining real developer questions, and identifying high-priority code regions that stress cross-repository retrieval.

Before you can evaluate a GraphRAG system on your code, you need to understand your codebase as a system.
If you’re working with 20–30 repositories, each owned by different teams and evolving at different speeds, “the codebase” is really an ecosystem: services, frontends, connectors, schemas, security layers, and infrastructure all woven together.
This post walks through how to map that ecosystem so you can design meaningful evaluation datasets for your cross-repository assistant.
Most evaluation failures trace back to one root cause: the evaluation dataset doesn’t reflect how your system is actually used.
If you randomly sample files or pick a few issues from one repository, you’ll get:
But in reality, your developers ask things like:
Answering those questions requires cross-repository retrieval, version awareness, and graph traversal. To build an evaluation set that reflects this, you first need a map: a dependency graph of your organizational codebase and the queries developers actually care about.
Start by extracting how your repositories depend on each other. At a minimum, you want to capture:
Conceptually, you’re building a graph where:
Here’s a simplified pseudocode sketch for Python repos:
from pathlib import Path
import ast
import json
def extract_cross_repo_dependencies(repo_root: Path) -> dict:
"""
Extract imports, API calls, and configuration references
that cross repository boundaries.
"""
dependencies = {
"imports": [], # Direct code imports
"api_calls": [], # HTTP/RPC calls to other services
"schema_references": [], # Database table/column references
"config_references": [] # Shared configuration keys
}
for py_file in repo_root.rglob("*.py"):
tree = ast.parse(py_file.read_text())
# Extract import statements
# Identify API client instantiations
# Find SQL query strings
# Locate config key accesses
return dependenciesIn a full implementation, you would:
requests, httpx, gRPC stubs) and extract target service names/URLs.settings["PAYMENTS_API_URL"]).Normalize these into a graph format (e.g., JSON, Neo4j, NetworkX) so you can query:
user-service?”payments database?”auth-middleware?”Once you can see edges, group repos by the role they play. For a typical 25-repository setup, you might see something like:
| Repository Type | Count | Cross-Repo Dependency Pattern |
|---|---|---|
| Backend Services | ~8 | Import connectors, call other backends, reference SQL schemas |
| Frontend Apps | ~4 | Call backend APIs, share component libraries |
| Connectors/SDKs | ~5 | Imported by backends and frontends |
| SQL/Migrations | ~3 | Referenced by backends, define data contracts |
| Security/Auth | ~2 | Wrapped around or injected into all service calls |
| Infrastructure | ~3 | Configure deployment, routing, and environment for all of the above |
This classification helps you:
These roles will later drive your query types and evaluation focus areas.
Next, you need to understand what developers actually ask when they’re stuck.
Useful sources:
#help-backend, #help-frontend, #incidents, #oncallFor each question you collect, note:
This corpus of real questions is the raw material for your evaluation query taxonomy.
From those real questions and your dependency graph, you can define a few core query types that your GraphRAG system must handle. Here’s a practical taxonomy:
“How does the payment processing work in the billing service?”
Characteristics:
Use cases:
“How is user authentication implemented across the system?”
Characteristics:
User, Auth, Session) that appears in many repos:Customer, Account, and User may be related.Use cases:
“If I change the
Userschema, what services are affected?”
Characteristics:
Use cases:
“Why did we change the payment flow in Q3?”
Characteristics:
Use cases:
These four types should cover a large portion of the high-value questions developers ask in a 25-repo environment.
Not all code is equally important to evaluate. Some files are essentially “leaf” nodes; others are central arteries.
Focus your evaluation on three kinds of hotspots:
These are files or modules that many others depend on:
Why they matter:
Look at your Git history over the last six months:
Why they matter:
These define how different parts of the system talk to each other:
Why they matter:
These regions should be overrepresented in your evaluation dataset. That’s where a GraphRAG system either shines or fails.
By the end of this mapping process, you should have three key deliverables.
Repository Dependency Graph (Visualized)
Query Pattern Taxonomy with Real Examples
Priority File List for Evaluation Dataset Construction
These artifacts give you a data-backed foundation for building evaluation datasets that truly reflect your organization’s complexity—and for measuring whether your GraphRAG system is genuinely helping developers navigate it.
By mapping your organizational codebase this way, you turn “25 random repositories” into a coherent graph of roles, dependencies, and real-world queries. Only then can your evaluation meaningfully test what matters: cross-repository retrieval, relationship traversal, version coherence, and grounded generation.