Evaluating Generation and Grounding in Multi-Repo Systems

# Generation Evaluation with Execution and LLM as Judge Once retrieval is done, the GraphRAG system has one final job. Turn retrieved context into an answer that engineers can trust. This blog focuses **only** on evaluating the generation layer. Not embeddings. Not graph traversal. Not ranking. Just the final output and whether it deserves to exist in a developer workflow. This is Blog 6 in the series and it builds directly on the evaluation dataset and retrieval evaluation defined earlier. --- ## Why generation evaluation is different in GraphRAG In classic RAG systems, generation quality is often judged by how fluent or similar the text looks compared to a reference. That breaks down immediately in GraphRAG. **A GraphRAG system is expected to:** - Generate code that compiles against internal APIs - Respect cross repository contracts - Avoid inventing fields, flags, or behaviors - Correctly explain interactions across repositories Metrics like BLEU, ROUGE, or embedding similarity cannot measure any of this. **Generation evaluation in GraphRAG must answer two concrete questions.** - Does the output actually work - Is the output faithful to the retrieved context **Those questions require two complementary techniques.** - Execution based evaluation for code - LLM as Judge evaluation for explanations and reasoning Either one alone is insufficient. --- ## Part 1: Execution based evaluation for generated code When the output is code, there is a single source of truth. **The runtime.** If the generated code fails to execute against your expected interfaces, the answer is wrong regardless of how confident it sounds. ### Design goals for execution evaluation A correct execution evaluator must: - Run code in isolation - Mock cross repository dependencies deterministically - Disable network access - Apply memory and time limits - Return structured failure reasons The goal is not fuzz testing. The goal is verifying correctness against known contracts. --- ### Execution sandbox implementation The implementation below executes generated code inside a Docker container with mocked cross repository imports. ```python import tempfile import docker from pathlib import Path from typing import Dict, List, Optional from dataclasses import dataclass @dataclass class ExecutionResult: passed: bool stdout: str stderr: str execution_time: float error_type: Optional[str] = None class CodeExecutionEvaluator: """ Executes generated code inside a sandboxed container with mocked cross repository dependencies. """ def __init__(self, repository_mocks: Dict[str, str]): """ repository_mocks maps import paths to mock implementations. Example: """ self.mocks = repository_mocks self.docker = docker.from_env() def evaluate( self, generated_code: str, test_cases: List[Dict], timeout: int = 30 ) -> ExecutionResult: with tempfile.TemporaryDirectory() as tmpdir: self._write_mocks(tmpdir) self._write_code(tmpdir, generated_code, test_cases) return self._run_container(tmpdir, timeout) def _write_mocks(self, tmpdir: str): for import_path, mock_code in self.mocks.items(): parts = import_path.split(".") base = Path(tmpdir) for i in range(len(parts)): pkg = base.joinpath(*parts[: i + 1]) pkg.mkdir(parents=True, exist_ok=True) init = pkg / "__init__.py" init.touch(exist_ok=True) file_path = base.joinpath(*parts).with_suffix(".py") file_path.write_text(mock_code) def _write_code( self, tmpdir: str, generated_code: str, test_cases: List[Dict] ): Path(tmpdir, "generated.py").write_text(generated_code) test_code = self._create_test_harness(test_cases) Path(tmpdir, "test_generated.py").write_text(test_code) def _create_test_harness(self, test_cases: List[Dict]) -> str: lines = [ "import generated", "", ] for i, case in enumerate(test_cases): lines.append(f"def test_case_():") lines.append(f" ") lines.append("") return "\n".join(lines) def _run_container(self, code_dir: str, timeout: int) -> ExecutionResult: try: output = self.docker.containers.run( image="python:3.11-slim", command="python -m pytest test_generated.py -q", volumes=}, working_dir="/code", remove=True, mem_limit="512m", network_disabled=True, timeout=timeout, ) return ExecutionResult( passed=True, stdout=output.decode(), stderr="", execution_time=0.0, ) except docker.errors.ContainerError as e: return ExecutionResult( passed=False, stdout=e.stdout.decode() if e.stdout else "", stderr=e.stderr.decode() if e.stderr else "", execution_time=0.0, error_type="test_failure", ) except Exception as e: return ExecutionResult( passed=False, stdout="", stderr=str(e), execution_time=0.0, error_type="execution_error", ) ```` ### What this evaluator measures This evaluator answers one precise question. Given correct retrieval, did the model generate code that works in our environment. It is equivalent to pass at one but scoped to your organization, your repositories, and your dependency graph. --- ## Part 2: LLM as Judge for non code outputs Many GraphRAG queries do not produce code. They produce explanations, a

Evaluating Generation and Grounding in Multi-Repo Systems

Generation Evaluation with Execution and LLM as Judge

Why generation evaluation is different in GraphRAG

Part 1: Execution based evaluation for generated code

Design goals for execution evaluation

Execution sandbox implementation

What this evaluator measures

Part 2: LLM as Judge for non code outputs

What LLM as Judge should evaluate

LLM Judge implementation

Why this works in practice

Part 3: Combined generation evaluator

What this gives you