Mapping Your Organizational Codebase for Evaluation

Before creating evaluation datasets for a GraphRAG system, you must understand your codebase topology. This post walks through building repository dependency graphs, classifying repos by role, mining real developer questions, and identifying high-priority code regions that stress cross-repository retrieval.

Mapping Your Organizational Codebase for Evaluation

Mapping Your Organizational Codebase for Evaluation

Before you can evaluate a GraphRAG system on your code, you need to understand your codebase as a system.

If you’re working with 20–30 repositories, each owned by different teams and evolving at different speeds, “the codebase” is really an ecosystem: services, frontends, connectors, schemas, security layers, and infrastructure all woven together.

This post walks through how to map that ecosystem so you can design meaningful evaluation datasets for your cross-repository assistant.


Why Mapping Comes Before Evaluation

Most evaluation failures trace back to one root cause: the evaluation dataset doesn’t reflect how your system is actually used.

If you randomly sample files or pick a few issues from one repository, you’ll get:

But in reality, your developers ask things like:

Answering those questions requires cross-repository retrieval, version awareness, and graph traversal. To build an evaluation set that reflects this, you first need a map: a dependency graph of your organizational codebase and the queries developers actually care about.


Step 1: Build a Repository Dependency Graph

Start by extracting how your repositories depend on each other. At a minimum, you want to capture:

Conceptually, you’re building a graph where:

Here’s a simplified pseudocode sketch for Python repos:

from pathlib import Path
import ast
import json

def extract_cross_repo_dependencies(repo_root: Path) -> dict:
    """
    Extract imports, API calls, and configuration references
    that cross repository boundaries.
    """
    dependencies = {
        "imports": [],           # Direct code imports
        "api_calls": [],         # HTTP/RPC calls to other services
        "schema_references": [], # Database table/column references
        "config_references": []  # Shared configuration keys
    }
    
    for py_file in repo_root.rglob("*.py"):
        tree = ast.parse(py_file.read_text())
        # Extract import statements
        # Identify API client instantiations
        # Find SQL query strings
        # Locate config key accesses
    
    return dependencies

In a full implementation, you would:

Normalize these into a graph format (e.g., JSON, Neo4j, NetworkX) so you can query:


Step 2: Classify Repositories by Role

Once you can see edges, group repos by the role they play. For a typical 25-repository setup, you might see something like:

Repository TypeCountCross-Repo Dependency Pattern
Backend Services~8Import connectors, call other backends, reference SQL schemas
Frontend Apps~4Call backend APIs, share component libraries
Connectors/SDKs~5Imported by backends and frontends
SQL/Migrations~3Referenced by backends, define data contracts
Security/Auth~2Wrapped around or injected into all service calls
Infrastructure~3Configure deployment, routing, and environment for all of the above

This classification helps you:

These roles will later drive your query types and evaluation focus areas.


Step 3: Mine Real Developer Questions

Next, you need to understand what developers actually ask when they’re stuck.

Useful sources:

For each question you collect, note:

This corpus of real questions is the raw material for your evaluation query taxonomy.


Step 4: Define Evaluation-Relevant Query Types

From those real questions and your dependency graph, you can define a few core query types that your GraphRAG system must handle. Here’s a practical taxonomy:

Type A: Single-Repository, Multi-File

“How does the payment processing work in the billing service?”

Characteristics:

Use cases:

Type B: Cross-Repository, Single Concept

“How is user authentication implemented across the system?”

Characteristics:

Use cases:

Type C: Cross-Repository, Dependency Chain

“If I change the User schema, what services are affected?”

Characteristics:

Use cases:

Type D: Cross-Repository, Temporal

“Why did we change the payment flow in Q3?”

Characteristics:

Use cases:

These four types should cover a large portion of the high-value questions developers ask in a 25-repo environment.


Step 5: Identify Evaluation-Critical Code Regions

Not all code is equally important to evaluate. Some files are essentially “leaf” nodes; others are central arteries.

Focus your evaluation on three kinds of hotspots:

1. High-Connectivity Nodes

These are files or modules that many others depend on:

Why they matter:

2. Recent Change Hotspots

Look at your Git history over the last six months:

Why they matter:

3. Architectural Boundaries

These define how different parts of the system talk to each other:

Why they matter:

These regions should be overrepresented in your evaluation dataset. That’s where a GraphRAG system either shines or fails.


Step 6: Produce Concrete Evaluation Artifacts

By the end of this mapping process, you should have three key deliverables.

  1. Repository Dependency Graph (Visualized)

    • A graph view (e.g., in Neo4j, Graphviz, or a custom dashboard) that shows:
      • Repos as nodes.
      • Edges for imports, API calls, schema references, config dependencies.
    • Ability to click on a node and see its inbound/outbound dependencies.
  2. Query Pattern Taxonomy with Real Examples

    • A catalog of questions grouped into:
      • Type A: Single-repo, multi-file.
      • Type B: Cross-repo, single concept.
      • Type C: Cross-repo, dependency chain.
      • Type D: Cross-repo, temporal.
    • For each type:
      • At least 5–10 real examples from your Slack/code reviews/incidents.
      • The set of repos and file types involved.
  3. Priority File List for Evaluation Dataset Construction

    • A ranked list of:
      • High-connectivity files/modules.
      • Recent change hotspots.
      • Boundary/contract layers.
    • For each file or module:
      • Why it’s important.
      • Example questions it participates in.

These artifacts give you a data-backed foundation for building evaluation datasets that truly reflect your organization’s complexity—and for measuring whether your GraphRAG system is genuinely helping developers navigate it.


By mapping your organizational codebase this way, you turn “25 random repositories” into a coherent graph of roles, dependencies, and real-world queries. Only then can your evaluation meaningfully test what matters: cross-repository retrieval, relationship traversal, version coherence, and grounded generation.