Mapping Your Organizational Codebase for Evaluation

Before creating evaluation datasets for a GraphRAG system, you must understand your codebase topology. This post walks through building repository dependency graphs, classifying repos by role, mining real developer questions, and identifying high-priority code regions that stress cross-repository retrieval.

Mapping Your Organizational Codebase for Evaluation

Mapping Your Organizational Codebase for Evaluation

Before you can evaluate a GraphRAG system on your code, you need to understand your codebase as a system.

If you’re working with 20–30 repositories, each owned by different teams and evolving at different speeds, “the codebase” is really an ecosystem: services, frontends, connectors, schemas, security layers, and infrastructure all woven together.

This post walks through how to map that ecosystem so you can design meaningful evaluation datasets for your cross-repository assistant.


Why Mapping Comes Before Evaluation

Most evaluation failures trace back to one root cause: the evaluation dataset doesn’t reflect how your system is actually used.

If you randomly sample files or pick a few issues from one repository, you’ll get:

  • Single-repo, single-file questions.
  • Little to no cross-service reasoning.
  • Very little stress on retrieval.

But in reality, your developers ask things like:

  • “If I change this user field, what breaks?”
  • “Where is authentication actually enforced?”
  • “Why did we change this integration last quarter?”

Answering those questions requires cross-repository retrieval, version awareness, and graph traversal. To build an evaluation set that reflects this, you first need a map: a dependency graph of your organizational codebase and the queries developers actually care about.


Step 1: Build a Repository Dependency Graph

Start by extracting how your repositories depend on each other. At a minimum, you want to capture:

  • Imports between modules and shared libraries.
  • API calls between services (HTTP, RPC, GraphQL, gRPC, messaging).
  • Schema references (tables, columns, migrations).
  • Configuration references (shared config keys, feature flags, secrets).

Conceptually, you’re building a graph where:

  • Nodes = repositories, services, schemas, key modules.
  • Edges = “calls”, “imports”, “references”, “configured by”.

Here’s a simplified pseudocode sketch for Python repos:

from pathlib import Path
import ast
import json

def extract_cross_repo_dependencies(repo_root: Path) -> dict:
    """
    Extract imports, API calls, and configuration references
    that cross repository boundaries.
    """
    dependencies = {
        "imports": [],           # Direct code imports
        "api_calls": [],         # HTTP/RPC calls to other services
        "schema_references": [], # Database table/column references
        "config_references": []  # Shared configuration keys
    }
    
    for py_file in repo_root.rglob("*.py"):
        tree = ast.parse(py_file.read_text())
        # Extract import statements
        # Identify API client instantiations
        # Find SQL query strings
        # Locate config key accesses
    
    return dependencies

In a full implementation, you would:

  • Parse import statements and map them to other internal packages/repos.
  • Look for HTTP/RPC clients (e.g., requests, httpx, gRPC stubs) and extract target service names/URLs.
  • Identify SQL strings and extract referenced tables/columns.
  • Track access to config modules (e.g., settings["PAYMENTS_API_URL"]).

Normalize these into a graph format (e.g., JSON, Neo4j, NetworkX) so you can query:

  • “Which repos depend on user-service?”
  • “Which services hit the payments database?”
  • “What calls flow through auth-middleware?”

Step 2: Classify Repositories by Role

Once you can see edges, group repos by the role they play. For a typical 25-repository setup, you might see something like:

Repository TypeCountCross-Repo Dependency Pattern
Backend Services~8Import connectors, call other backends, reference SQL schemas
Frontend Apps~4Call backend APIs, share component libraries
Connectors/SDKs~5Imported by backends and frontends
SQL/Migrations~3Referenced by backends, define data contracts
Security/Auth~2Wrapped around or injected into all service calls
Infrastructure~3Configure deployment, routing, and environment for all of the above

This classification helps you:

  • Understand which repos are sources of truth (e.g., schemas, auth).
  • See where contract boundaries live (APIs, types, connectors).
  • Identify central hubs that many others depend on.

These roles will later drive your query types and evaluation focus areas.


Step 3: Mine Real Developer Questions

Next, you need to understand what developers actually ask when they’re stuck.

Useful sources:

  • Slack / Teams channels
    • #help-backend, #help-frontend, #incidents, #oncall
    • Look for threads where someone asks a question and multiple services/repos get discussed.
  • Code review comments
    • “Where else is this called?”
    • “Does this break X integration?”
    • “Is this consistent with the schema in Y?”
  • Onboarding docs and FAQs
    • “How does authentication work end-to-end?”
    • “How does the billing pipeline operate?”
  • Incident post-mortems
    • “What caused this outage?”
    • “Which services were impacted and why?”

For each question you collect, note:

  • Which repositories were referenced in the answer.
  • Whether the resolution required:
    • Following a dependency chain.
    • Understanding a shared concept across multiple repos.
    • Looking at historical changes (commits, PRs).

This corpus of real questions is the raw material for your evaluation query taxonomy.


Step 4: Define Evaluation-Relevant Query Types

From those real questions and your dependency graph, you can define a few core query types that your GraphRAG system must handle. Here’s a practical taxonomy:

Type A: Single-Repository, Multi-File

“How does the payment processing work in the billing service?”

Characteristics:

  • Stays within one repository, but:
    • Spans multiple modules (e.g., handlers, services, domain logic).
    • May involve config, models, and background jobs.
  • Requires:
    • Retrieval across roughly 5–15 files.
    • Local reasoning about control flow and data flow.

Use cases:

  • Service onboarding.
  • Understanding how a specific feature works end-to-end within a single codebase.

Type B: Cross-Repository, Single Concept

“How is user authentication implemented across the system?”

Characteristics:

  • Focuses on one concept (User, Auth, Session) that appears in many repos:
    • Auth backend.
    • Frontend session management.
    • Database user tables.
    • Security middleware / gateways.
  • Requires:
    • Retrieval from multiple repositories.
    • Entity resolution: understanding that Customer, Account, and User may be related.
    • Producing a coherent concept-level explanation.

Use cases:

  • Security reviews.
  • Cross-service feature design.
  • Architecture documentation.

Type C: Cross-Repository, Dependency Chain

“If I change the User schema, what services are affected?”

Characteristics:

  • Starts at one artifact (e.g., SQL schema) and fans out across dependent systems:
    • SQL schema → backend ORMs/models → API contracts → frontend types → connectors/SDKs.
  • Requires:
    • Multi-hop graph traversal over your dependency graph.
    • Identifying indirect impact (e.g., a connector used by three services).

Use cases:

  • Change impact analysis.
  • Migration planning.
  • Safe refactoring at scale.

Type D: Cross-Repository, Temporal

“Why did we change the payment flow in Q3?”

Characteristics:

  • Spans time as well as repositories:
    • Requires looking at Git history, PR descriptions, commit messages.
    • Often involves documentation updates and incident reports.
  • Requires:
    • Retrieval conditioned on time windows (e.g., Q3).
    • Linking code changes to narrative artifacts (docs, post-mortems, RFCs).

Use cases:

  • Post-mortems.
  • Architectural decision tracking.
  • Audits and compliance.

These four types should cover a large portion of the high-value questions developers ask in a 25-repo environment.


Step 5: Identify Evaluation-Critical Code Regions

Not all code is equally important to evaluate. Some files are essentially “leaf” nodes; others are central arteries.

Focus your evaluation on three kinds of hotspots:

1. High-Connectivity Nodes

These are files or modules that many others depend on:

  • Files imported by 10+ other files across repositories.
  • API endpoints called by multiple consumers (backends, frontends, connectors).
  • Shared type definitions and DTOs used in contracts.

Why they matter:

  • Any retrieval failure here cascades.
  • These are natural choke points for bugs, regressions, and integration issues.
  • Evaluating your system on questions involving these nodes stresses its ability to navigate dense parts of the graph.

2. Recent Change Hotspots

Look at your Git history over the last six months:

  • Files with high commit frequency.
  • Areas with many PR comments or reverts.
  • Code associated with recent incidents or “hot” features.

Why they matter:

  • This is where developers currently spend time and make changes.
  • Evaluation on these areas tests:
    • Version awareness (has the system “seen” the new patterns?).
    • Robustness to churn (can it reason even as files evolve?).

3. Architectural Boundaries

These define how different parts of the system talk to each other:

  • Service-to-service communication layers.
  • Database access layers / repositories.
  • Authentication and authorization checks.
  • Gateways, API edges, and public interfaces.

Why they matter:

  • Cross-repository retrieval is most critical at boundaries.
  • Many developer questions are really about “what crosses this boundary and how?”

These regions should be overrepresented in your evaluation dataset. That’s where a GraphRAG system either shines or fails.


Step 6: Produce Concrete Evaluation Artifacts

By the end of this mapping process, you should have three key deliverables.

  1. Repository Dependency Graph (Visualized)

    • A graph view (e.g., in Neo4j, Graphviz, or a custom dashboard) that shows:
      • Repos as nodes.
      • Edges for imports, API calls, schema references, config dependencies.
    • Ability to click on a node and see its inbound/outbound dependencies.
  2. Query Pattern Taxonomy with Real Examples

    • A catalog of questions grouped into:
      • Type A: Single-repo, multi-file.
      • Type B: Cross-repo, single concept.
      • Type C: Cross-repo, dependency chain.
      • Type D: Cross-repo, temporal.
    • For each type:
      • At least 5–10 real examples from your Slack/code reviews/incidents.
      • The set of repos and file types involved.
  3. Priority File List for Evaluation Dataset Construction

    • A ranked list of:
      • High-connectivity files/modules.
      • Recent change hotspots.
      • Boundary/contract layers.
    • For each file or module:
      • Why it’s important.
      • Example questions it participates in.

These artifacts give you a data-backed foundation for building evaluation datasets that truly reflect your organization’s complexity—and for measuring whether your GraphRAG system is genuinely helping developers navigate it.


By mapping your organizational codebase this way, you turn “25 random repositories” into a coherent graph of roles, dependencies, and real-world queries. Only then can your evaluation meaningfully test what matters: cross-repository retrieval, relationship traversal, version coherence, and grounded generation.

← All posts