Generation Evaluation with Execution and LLM as Judge
Once retrieval is done, the GraphRAG system has one final job.
Turn retrieved context into an answer that engineers can trust.
This blog focuses only on evaluating the generation layer. Not embeddings. Not graph traversal. Not ranking. Just the final output and whether it deserves to exist in a developer workflow.
This is Blog 6 in the series and it builds directly on the evaluation dataset and retrieval evaluation defined earlier.
Why generation evaluation is different in GraphRAG
In classic RAG systems, generation quality is often judged by how fluent or similar the text looks compared to a reference.
That breaks down immediately in GraphRAG.
A GraphRAG system is expected to:
- Generate code that compiles against internal APIs
- Respect cross repository contracts
- Avoid inventing fields, flags, or behaviors
- Correctly explain interactions across repositories
Metrics like BLEU, ROUGE, or embedding similarity cannot measure any of this.
Generation evaluation in GraphRAG must answer two concrete questions.
- Does the output actually work
- Is the output faithful to the retrieved context
Those questions require two complementary techniques.
- Execution based evaluation for code
- LLM as Judge evaluation for explanations and reasoning
Either one alone is insufficient.
Part 1: Execution based evaluation for generated code
When the output is code, there is a single source of truth.
The runtime.
If the generated code fails to execute against your expected interfaces, the answer is wrong regardless of how confident it sounds.
Design goals for execution evaluation
A correct execution evaluator must:
- Run code in isolation
- Mock cross repository dependencies deterministically
- Disable network access
- Apply memory and time limits
- Return structured failure reasons
The goal is not fuzz testing. The goal is verifying correctness against known contracts.
Execution sandbox implementation
The implementation below executes generated code inside a Docker container with mocked cross repository imports.
import tempfile
import docker
from pathlib import Path
from typing import Dict, List, Optional
from dataclasses import dataclass
@dataclass
class ExecutionResult:
passed: bool
stdout: str
stderr: str
execution_time: float
error_type: Optional[str] = None
class CodeExecutionEvaluator:
"""
Executes generated code inside a sandboxed container with
mocked cross repository dependencies.
"""
def __init__(self, repository_mocks: Dict[str, str]):
"""
repository_mocks maps import paths to mock implementations.
Example:
{
"payment_service.client": "class PaymentClient: ..."
}
"""
self.mocks = repository_mocks
self.docker = docker.from_env()
def evaluate(
self,
generated_code: str,
test_cases: List[Dict],
timeout: int = 30
) -> ExecutionResult:
with tempfile.TemporaryDirectory() as tmpdir:
self._write_mocks(tmpdir)
self._write_code(tmpdir, generated_code, test_cases)
return self._run_container(tmpdir, timeout)
def _write_mocks(self, tmpdir: str):
for import_path, mock_code in self.mocks.items():
parts = import_path.split(".")
base = Path(tmpdir)
for i in range(len(parts)):
pkg = base.joinpath(*parts[: i + 1])
pkg.mkdir(parents=True, exist_ok=True)
init = pkg / "__init__.py"
init.touch(exist_ok=True)
file_path = base.joinpath(*parts).with_suffix(".py")
file_path.write_text(mock_code)
def _write_code(
self,
tmpdir: str,
generated_code: str,
test_cases: List[Dict]
):
Path(tmpdir, "generated.py").write_text(generated_code)
test_code = self._create_test_harness(test_cases)
Path(tmpdir, "test_generated.py").write_text(test_code)
def _create_test_harness(self, test_cases: List[Dict]) -> str:
lines = [
"import generated",
"",
]
for i, case in enumerate(test_cases):
lines.append(f"def test_case_{i}():")
lines.append(f" {case['assertion']}")
lines.append("")
return "\n".join(lines)
def _run_container(self, code_dir: str, timeout: int) -> ExecutionResult:
try:
output = self.docker.containers.run(
image="python:3.11-slim",
command="python -m pytest test_generated.py -q",
volumes={code_dir: {"bind": "/code", "mode": "ro"}},
working_dir="/code",
remove=True,
mem_limit="512m",
network_disabled=True,
timeout=timeout,
)
return ExecutionResult(
passed=True,
stdout=output.decode(),
stderr="",
execution_time=0.0,
)
except docker.errors.ContainerError as e:
return ExecutionResult(
passed=False,
stdout=e.stdout.decode() if e.stdout else "",
stderr=e.stderr.decode() if e.stderr else "",
execution_time=0.0,
error_type="test_failure",
)
except Exception as e:
return ExecutionResult(
passed=False,
stdout="",
stderr=str(e),
execution_time=0.0,
error_type="execution_error",
)What this evaluator measures
This evaluator answers one precise question.
Given correct retrieval, did the model generate code that works in our environment.
It is equivalent to pass at one but scoped to your organization, your repositories, and your dependency graph.
Part 2: LLM as Judge for non code outputs
Many GraphRAG queries do not produce code.
They produce explanations, architectural reasoning, migration guidance, or dependency analysis.
These outputs cannot be executed. They must be judged.
This is where LLM as Judge is appropriate when used carefully and with structure.
What LLM as Judge should evaluate
Only things that humans would normally evaluate during review:
Groundedness to retrieved context Completeness relative to an expert answer Correct integration across repositories
It should never replace execution based evaluation.
LLM Judge implementation
from openai import OpenAI
import json
from typing import Dict
class LLMJudge:
"""
Structured evaluation of generation quality.
"""
GROUNDEDNESS_PROMPT = """
You are evaluating whether an AI assistant response is grounded in the provided context.
Context:
{context}
Question:
{question}
Response:
{response}
For each factual claim:
1. Identify the claim
2. Quote supporting evidence or return null
3. Rate SUPPORTED INFERRED or UNSUPPORTED
Return JSON with groundedness_score between 0.0 and 1.0.
"""
COMPLETENESS_PROMPT = """
Question:
{question}
Gold Answer:
{gold_answer}
Response:
{response}
Identify missing aspects, incorrect claims, and coverage.
Return JSON with completeness_score and correctness_score.
"""
CROSS_REPO_PROMPT = """
Repositories:
{repositories}
Contexts:
{contexts}
Question:
{question}
Response:
{response}
Evaluate attribution accuracy and cross repository integration.
Return JSON with overall_integration_score.
"""
def __init__(self, model: str = "gpt-4o"):
self.client = OpenAI()
self.model = model
def _run(self, prompt: str) -> Dict:
result = self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
)
return json.loads(result.choices[0].message.content)
def evaluate_groundedness(self, question: str, response: str, context: str) -> Dict:
return self._run(
self.GROUNDEDNESS_PROMPT.format(
question=question,
response=response,
context=context,
)
)
def evaluate_completeness(self, question: str, response: str, gold_answer: str) -> Dict:
return self._run(
self.COMPLETENESS_PROMPT.format(
question=question,
response=response,
gold_answer=gold_answer,
)
)
def evaluate_cross_repo_integration(
self,
question: str,
response: str,
repository_contexts: Dict[str, str],
) -> Dict:
contexts = "\n\n".join(
f"{repo}\n{ctx}" for repo, ctx in repository_contexts.items()
)
return self._run(
self.CROSS_REPO_PROMPT.format(
repositories=list(repository_contexts.keys()),
contexts=contexts,
question=question,
response=response,
)
)Why this works in practice
The judge does not score style or confidence.
It scores evidence, omissions, and attribution.
It enforces the same discipline a senior engineer applies during design review.
Part 3: Combined generation evaluator
The real value comes from combining execution based evaluation and LLM based judgment under one evaluator.
class GraphRAGGenerationEvaluator:
"""
End to end generation evaluation for GraphRAG.
"""
def __init__(self, graphrag_system, gold_dataset, repository_mocks):
self.system = graphrag_system
self.gold = gold_dataset
self.code_eval = CodeExecutionEvaluator(repository_mocks)
self.judge = LLMJudge()
def evaluate_all(self) -> Dict:
results = []
for item in self.gold:
response = self.system.query(item["query_text"])
if item.get("expects_code"):
results.append(self._eval_code(item, response))
else:
results.append(self._eval_explanation(item, response))
return self._aggregate(results)
def _eval_code(self, item, response) -> Dict:
exec_result = self.code_eval.evaluate(
response.generated_code,
item["test_cases"],
)
grounded = self.judge.evaluate_groundedness(
item["query_text"],
response.generated_code,
response.retrieved_context,
)
return {
"query_id": item["query_id"],
"task": "code",
"pass_at_1": 1.0 if exec_result.passed else 0.0,
"groundedness": grounded["groundedness_score"],
"error": exec_result.error_type,
}
def _eval_explanation(self, item, response) -> Dict:
grounded = self.judge.evaluate_groundedness(
item["query_text"],
response.answer,
response.retrieved_context,
)
completeness = None
if "gold_answer" in item:
completeness = self.judge.evaluate_completeness(
item["query_text"],
response.answer,
item["gold_answer"],
)
integration = None
if item["query_type"] in ["cross_repo_concept", "dependency_chain"]:
integration = self.judge.evaluate_cross_repo_integration(
item["query_text"],
response.answer,
response.repository_contexts,
)
return {
"query_id": item["query_id"],
"task": "explanation",
"groundedness": grounded["groundedness_score"],
"completeness": completeness.get("completeness_score") if completeness else None,
"integration": integration.get("overall_integration_score") if integration else None,
}What this gives you
After this stage, you can measure generation quality in a way that actually reflects developer reality.
Does generated code run
- Is the model hallucinating APIs
- Is cross repository reasoning correct
- Is the output faithful to retrieved context
This is where GraphRAG systems either earn trust or lose it permanently.
In the next blog, we will connect retrieval and generation into a single end to end system pass and show how small retrieval errors cascade into large generation failures across repositories.
