Retrieving nodes is only half the battle; the LLM must synthesize code that adheres to cross-repo constraints. This post explores measuring faithfulness, checking execution-level correctness against internal SDKs, and using LLM-as-a-Judge to verify that generated code respects the security and type contracts of separate repositories.

Once retrieval is done, the GraphRAG system has one final job.
Turn retrieved context into an answer that engineers can trust.
This blog focuses only on evaluating the generation layer. Not embeddings. Not graph traversal. Not ranking. Just the final output and whether it deserves to exist in a developer workflow.
This is Blog 6 in the series and it builds directly on the evaluation dataset and retrieval evaluation defined earlier.
In classic RAG systems, generation quality is often judged by how fluent or similar the text looks compared to a reference.
That breaks down immediately in GraphRAG.
A GraphRAG system is expected to:
Metrics like BLEU, ROUGE, or embedding similarity cannot measure any of this.
Generation evaluation in GraphRAG must answer two concrete questions.
Those questions require two complementary techniques.
Either one alone is insufficient.
When the output is code, there is a single source of truth.
The runtime.
If the generated code fails to execute against your expected interfaces, the answer is wrong regardless of how confident it sounds.
A correct execution evaluator must:
The goal is not fuzz testing. The goal is verifying correctness against known contracts.
The implementation below executes generated code inside a Docker container with mocked cross repository imports.
import tempfile
import docker
from pathlib import Path
from typing import Dict, List, Optional
from dataclasses import dataclass
@dataclass
class ExecutionResult:
passed: bool
stdout: str
stderr: str
execution_time: float
error_type: Optional[str] = None
class CodeExecutionEvaluator:
"""
Executes generated code inside a sandboxed container with
mocked cross repository dependencies.
"""
def __init__(self, repository_mocks: Dict[str, str]):
"""
repository_mocks maps import paths to mock implementations.
Example:
{
"payment_service.client": "class PaymentClient: ..."
}
"""
self.mocks = repository_mocks
self.docker = docker.from_env()
def evaluate(
self,
generated_code: str,
test_cases: List[Dict],
timeout: int = 30
) -> ExecutionResult:
with tempfile.TemporaryDirectory() as tmpdir:
self._write_mocks(tmpdir)
self._write_code(tmpdir, generated_code, test_cases)
return self._run_container(tmpdir, timeout)
def _write_mocks(self, tmpdir: str):
for import_path, mock_code in self.mocks.items():
parts = import_path.split(".")
base = Path(tmpdir)
for i in range(len(parts)):
pkg = base.joinpath(*parts[: i + 1])
pkg.mkdir(parents=True, exist_ok=True)
init = pkg / "__init__.py"
init.touch(exist_ok=True)
file_path = base.joinpath(*parts).with_suffix(".py")
file_path.write_text(mock_code)
def _write_code(
self,
tmpdir: str,
generated_code: str,
test_cases: List[Dict]
):
Path(tmpdir, "generated.py").write_text(generated_code)
test_code = self._create_test_harness(test_cases)
Path(tmpdir, "test_generated.py").write_text(test_code)
def _create_test_harness(self, test_cases: List[Dict]) -> str:
lines = [
"import generated",
"",
]
for i, case in enumerate(test_cases):
lines.append(f"def test_case_{i}():")
lines.append(f" {case['assertion']}")
lines.append("")
return "\n".join(lines)
def _run_container(self, code_dir: str, timeout: int) -> ExecutionResult:
try:
output = self.docker.containers.run(
image="python:3.11-slim",
command="python -m pytest test_generated.py -q",
volumes={code_dir: {"bind": "/code", "mode": "ro"}},
working_dir="/code",
remove=True,
mem_limit="512m",
network_disabled=True,
timeout=timeout,
)
return ExecutionResult(
passed=True,
stdout=output.decode(),
stderr="",
execution_time=0.0,
)
except docker.errors.ContainerError as e:
return ExecutionResult(
passed=False,
stdout=e.stdout.decode() if e.stdout else "",
stderr=e.stderr.decode() if e.stderr else "",
execution_time=0.0,
error_type="test_failure",
)
except Exception as e:
return ExecutionResult(
passed=False,
stdout="",
stderr=str(e),
execution_time=0.0,
error_type="execution_error",
)This evaluator answers one precise question.
Given correct retrieval, did the model generate code that works in our environment.
It is equivalent to pass at one but scoped to your organization, your repositories, and your dependency graph.
Many GraphRAG queries do not produce code.
They produce explanations, architectural reasoning, migration guidance, or dependency analysis.
These outputs cannot be executed. They must be judged.
This is where LLM as Judge is appropriate when used carefully and with structure.
Only things that humans would normally evaluate during review:
Groundedness to retrieved context Completeness relative to an expert answer Correct integration across repositories
It should never replace execution based evaluation.
from openai import OpenAI
import json
from typing import Dict
class LLMJudge:
"""
Structured evaluation of generation quality.
"""
GROUNDEDNESS_PROMPT = """
You are evaluating whether an AI assistant response is grounded in the provided context.
Context:
{context}
Question:
{question}
Response:
{response}
For each factual claim:
1. Identify the claim
2. Quote supporting evidence or return null
3. Rate SUPPORTED INFERRED or UNSUPPORTED
Return JSON with groundedness_score between 0.0 and 1.0.
"""
COMPLETENESS_PROMPT = """
Question:
{question}
Gold Answer:
{gold_answer}
Response:
{response}
Identify missing aspects, incorrect claims, and coverage.
Return JSON with completeness_score and correctness_score.
"""
CROSS_REPO_PROMPT = """
Repositories:
{repositories}
Contexts:
{contexts}
Question:
{question}
Response:
{response}
Evaluate attribution accuracy and cross repository integration.
Return JSON with overall_integration_score.
"""
def __init__(self, model: str = "gpt-4o"):
self.client = OpenAI()
self.model = model
def _run(self, prompt: str) -> Dict:
result = self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
)
return json.loads(result.choices[0].message.content)
def evaluate_groundedness(self, question: str, response: str, context: str) -> Dict:
return self._run(
self.GROUNDEDNESS_PROMPT.format(
question=question,
response=response,
context=context,
)
)
def evaluate_completeness(self, question: str, response: str, gold_answer: str) -> Dict:
return self._run(
self.COMPLETENESS_PROMPT.format(
question=question,
response=response,
gold_answer=gold_answer,
)
)
def evaluate_cross_repo_integration(
self,
question: str,
response: str,
repository_contexts: Dict[str, str],
) -> Dict:
contexts = "\n\n".join(
f"{repo}\n{ctx}" for repo, ctx in repository_contexts.items()
)
return self._run(
self.CROSS_REPO_PROMPT.format(
repositories=list(repository_contexts.keys()),
contexts=contexts,
question=question,
response=response,
)
)The judge does not score style or confidence.
It scores evidence, omissions, and attribution.
It enforces the same discipline a senior engineer applies during design review.
The real value comes from combining execution based evaluation and LLM based judgment under one evaluator.
class GraphRAGGenerationEvaluator:
"""
End to end generation evaluation for GraphRAG.
"""
def __init__(self, graphrag_system, gold_dataset, repository_mocks):
self.system = graphrag_system
self.gold = gold_dataset
self.code_eval = CodeExecutionEvaluator(repository_mocks)
self.judge = LLMJudge()
def evaluate_all(self) -> Dict:
results = []
for item in self.gold:
response = self.system.query(item["query_text"])
if item.get("expects_code"):
results.append(self._eval_code(item, response))
else:
results.append(self._eval_explanation(item, response))
return self._aggregate(results)
def _eval_code(self, item, response) -> Dict:
exec_result = self.code_eval.evaluate(
response.generated_code,
item["test_cases"],
)
grounded = self.judge.evaluate_groundedness(
item["query_text"],
response.generated_code,
response.retrieved_context,
)
return {
"query_id": item["query_id"],
"task": "code",
"pass_at_1": 1.0 if exec_result.passed else 0.0,
"groundedness": grounded["groundedness_score"],
"error": exec_result.error_type,
}
def _eval_explanation(self, item, response) -> Dict:
grounded = self.judge.evaluate_groundedness(
item["query_text"],
response.answer,
response.retrieved_context,
)
completeness = None
if "gold_answer" in item:
completeness = self.judge.evaluate_completeness(
item["query_text"],
response.answer,
item["gold_answer"],
)
integration = None
if item["query_type"] in ["cross_repo_concept", "dependency_chain"]:
integration = self.judge.evaluate_cross_repo_integration(
item["query_text"],
response.answer,
response.repository_contexts,
)
return {
"query_id": item["query_id"],
"task": "explanation",
"groundedness": grounded["groundedness_score"],
"completeness": completeness.get("completeness_score") if completeness else None,
"integration": integration.get("overall_integration_score") if integration else None,
}After this stage, you can measure generation quality in a way that actually reflects developer reality.
Does generated code run
This is where GraphRAG systems either earn trust or lose it permanently.
In the next blog, we will connect retrieval and generation into a single end to end system pass and show how small retrieval errors cascade into large generation failures across repositories.