Evaluating Generation and Grounding in Multi-Repo Systems

Retrieving nodes is only half the battle; the LLM must synthesize code that adheres to cross-repo constraints. This post explores measuring faithfulness, checking execution-level correctness against internal SDKs, and using LLM-as-a-Judge to verify that generated code respects the security and type contracts of separate repositories.

Evaluating Generation and Grounding in Multi-Repo Systems

Generation Evaluation with Execution and LLM as Judge

Once retrieval is done, the GraphRAG system has one final job.

Turn retrieved context into an answer that engineers can trust.

This blog focuses only on evaluating the generation layer. Not embeddings. Not graph traversal. Not ranking. Just the final output and whether it deserves to exist in a developer workflow.

This is Blog 6 in the series and it builds directly on the evaluation dataset and retrieval evaluation defined earlier.


Why generation evaluation is different in GraphRAG

In classic RAG systems, generation quality is often judged by how fluent or similar the text looks compared to a reference.

That breaks down immediately in GraphRAG.

A GraphRAG system is expected to:

Metrics like BLEU, ROUGE, or embedding similarity cannot measure any of this.

Generation evaluation in GraphRAG must answer two concrete questions.

Those questions require two complementary techniques.

Either one alone is insufficient.


Part 1: Execution based evaluation for generated code

When the output is code, there is a single source of truth.

The runtime.

If the generated code fails to execute against your expected interfaces, the answer is wrong regardless of how confident it sounds.

Design goals for execution evaluation

A correct execution evaluator must:

The goal is not fuzz testing. The goal is verifying correctness against known contracts.


Execution sandbox implementation

The implementation below executes generated code inside a Docker container with mocked cross repository imports.

import tempfile
import docker
from pathlib import Path
from typing import Dict, List, Optional
from dataclasses import dataclass

@dataclass
class ExecutionResult:
    passed: bool
    stdout: str
    stderr: str
    execution_time: float
    error_type: Optional[str] = None

class CodeExecutionEvaluator:
    """
    Executes generated code inside a sandboxed container with
    mocked cross repository dependencies.
    """

    def __init__(self, repository_mocks: Dict[str, str]):
        """
        repository_mocks maps import paths to mock implementations.
        Example:
        {
          "payment_service.client": "class PaymentClient: ..."
        }
        """
        self.mocks = repository_mocks
        self.docker = docker.from_env()

    def evaluate(
        self,
        generated_code: str,
        test_cases: List[Dict],
        timeout: int = 30
    ) -> ExecutionResult:
        with tempfile.TemporaryDirectory() as tmpdir:
            self._write_mocks(tmpdir)
            self._write_code(tmpdir, generated_code, test_cases)
            return self._run_container(tmpdir, timeout)

    def _write_mocks(self, tmpdir: str):
        for import_path, mock_code in self.mocks.items():
            parts = import_path.split(".")
            base = Path(tmpdir)

            for i in range(len(parts)):
                pkg = base.joinpath(*parts[: i + 1])
                pkg.mkdir(parents=True, exist_ok=True)
                init = pkg / "__init__.py"
                init.touch(exist_ok=True)

            file_path = base.joinpath(*parts).with_suffix(".py")
            file_path.write_text(mock_code)

    def _write_code(
        self,
        tmpdir: str,
        generated_code: str,
        test_cases: List[Dict]
    ):
        Path(tmpdir, "generated.py").write_text(generated_code)

        test_code = self._create_test_harness(test_cases)
        Path(tmpdir, "test_generated.py").write_text(test_code)

    def _create_test_harness(self, test_cases: List[Dict]) -> str:
        lines = [
            "import generated",
            "",
        ]
        for i, case in enumerate(test_cases):
            lines.append(f"def test_case_{i}():")
            lines.append(f"    {case['assertion']}")
            lines.append("")
        return "\n".join(lines)

    def _run_container(self, code_dir: str, timeout: int) -> ExecutionResult:
        try:
            output = self.docker.containers.run(
                image="python:3.11-slim",
                command="python -m pytest test_generated.py -q",
                volumes={code_dir: {"bind": "/code", "mode": "ro"}},
                working_dir="/code",
                remove=True,
                mem_limit="512m",
                network_disabled=True,
                timeout=timeout,
            )
            return ExecutionResult(
                passed=True,
                stdout=output.decode(),
                stderr="",
                execution_time=0.0,
            )
        except docker.errors.ContainerError as e:
            return ExecutionResult(
                passed=False,
                stdout=e.stdout.decode() if e.stdout else "",
                stderr=e.stderr.decode() if e.stderr else "",
                execution_time=0.0,
                error_type="test_failure",
            )
        except Exception as e:
            return ExecutionResult(
                passed=False,
                stdout="",
                stderr=str(e),
                execution_time=0.0,
                error_type="execution_error",
            )

What this evaluator measures

This evaluator answers one precise question.

Given correct retrieval, did the model generate code that works in our environment.

It is equivalent to pass at one but scoped to your organization, your repositories, and your dependency graph.


Part 2: LLM as Judge for non code outputs

Many GraphRAG queries do not produce code.

They produce explanations, architectural reasoning, migration guidance, or dependency analysis.

These outputs cannot be executed. They must be judged.

This is where LLM as Judge is appropriate when used carefully and with structure.


What LLM as Judge should evaluate

Only things that humans would normally evaluate during review:

Groundedness to retrieved context Completeness relative to an expert answer Correct integration across repositories

It should never replace execution based evaluation.


LLM Judge implementation

from openai import OpenAI
import json
from typing import Dict

class LLMJudge:
    """
    Structured evaluation of generation quality.
    """

    GROUNDEDNESS_PROMPT = """
        You are evaluating whether an AI assistant response is grounded in the provided context.

        Context:
        {context}

        Question:
        {question}

        Response:
        {response}

        For each factual claim:
        1. Identify the claim
        2. Quote supporting evidence or return null
        3. Rate SUPPORTED INFERRED or UNSUPPORTED

        Return JSON with groundedness_score between 0.0 and 1.0.
        """

    COMPLETENESS_PROMPT = """
        Question:
        {question}

        Gold Answer:
        {gold_answer}

        Response:
        {response}

        Identify missing aspects, incorrect claims, and coverage.
        Return JSON with completeness_score and correctness_score.
        """

    CROSS_REPO_PROMPT = """
        Repositories:
        {repositories}

        Contexts:
        {contexts}

        Question:
        {question}

        Response:
        {response}

        Evaluate attribution accuracy and cross repository integration.
        Return JSON with overall_integration_score.
        """

    def __init__(self, model: str = "gpt-4o"):
        self.client = OpenAI()
        self.model = model

    def _run(self, prompt: str) -> Dict:
        result = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"},
        )
        return json.loads(result.choices[0].message.content)

    def evaluate_groundedness(self, question: str, response: str, context: str) -> Dict:
        return self._run(
            self.GROUNDEDNESS_PROMPT.format(
                question=question,
                response=response,
                context=context,
            )
        )

    def evaluate_completeness(self, question: str, response: str, gold_answer: str) -> Dict:
        return self._run(
            self.COMPLETENESS_PROMPT.format(
                question=question,
                response=response,
                gold_answer=gold_answer,
            )
        )

    def evaluate_cross_repo_integration(
        self,
        question: str,
        response: str,
        repository_contexts: Dict[str, str],
    ) -> Dict:
        contexts = "\n\n".join(
            f"{repo}\n{ctx}" for repo, ctx in repository_contexts.items()
        )
        return self._run(
            self.CROSS_REPO_PROMPT.format(
                repositories=list(repository_contexts.keys()),
                contexts=contexts,
                question=question,
                response=response,
            )
        )

Why this works in practice

The judge does not score style or confidence.

It scores evidence, omissions, and attribution.

It enforces the same discipline a senior engineer applies during design review.


Part 3: Combined generation evaluator

The real value comes from combining execution based evaluation and LLM based judgment under one evaluator.

class GraphRAGGenerationEvaluator:
    """
    End to end generation evaluation for GraphRAG.
    """

    def __init__(self, graphrag_system, gold_dataset, repository_mocks):
        self.system = graphrag_system
        self.gold = gold_dataset
        self.code_eval = CodeExecutionEvaluator(repository_mocks)
        self.judge = LLMJudge()

    def evaluate_all(self) -> Dict:
        results = []

        for item in self.gold:
            response = self.system.query(item["query_text"])

            if item.get("expects_code"):
                results.append(self._eval_code(item, response))
            else:
                results.append(self._eval_explanation(item, response))

        return self._aggregate(results)

    def _eval_code(self, item, response) -> Dict:
        exec_result = self.code_eval.evaluate(
            response.generated_code,
            item["test_cases"],
        )

        grounded = self.judge.evaluate_groundedness(
            item["query_text"],
            response.generated_code,
            response.retrieved_context,
        )

        return {
            "query_id": item["query_id"],
            "task": "code",
            "pass_at_1": 1.0 if exec_result.passed else 0.0,
            "groundedness": grounded["groundedness_score"],
            "error": exec_result.error_type,
        }

    def _eval_explanation(self, item, response) -> Dict:
        grounded = self.judge.evaluate_groundedness(
            item["query_text"],
            response.answer,
            response.retrieved_context,
        )

        completeness = None
        if "gold_answer" in item:
            completeness = self.judge.evaluate_completeness(
                item["query_text"],
                response.answer,
                item["gold_answer"],
            )

        integration = None
        if item["query_type"] in ["cross_repo_concept", "dependency_chain"]:
            integration = self.judge.evaluate_cross_repo_integration(
                item["query_text"],
                response.answer,
                response.repository_contexts,
            )

        return {
            "query_id": item["query_id"],
            "task": "explanation",
            "groundedness": grounded["groundedness_score"],
            "completeness": completeness.get("completeness_score") if completeness else None,
            "integration": integration.get("overall_integration_score") if integration else None,
        }

What this gives you

After this stage, you can measure generation quality in a way that actually reflects developer reality.

Does generated code run

This is where GraphRAG systems either earn trust or lose it permanently.

In the next blog, we will connect retrieval and generation into a single end to end system pass and show how small retrieval errors cascade into large generation failures across repositories.