Building Your Evaluation Dataset from Organizational Repositories

# Building Your Evaluation Dataset from Organizational Repositories Getting RAG evaluation right starts with having the right dataset. Not a theoretical one. Not something scraped from public benchmarks. One built from your actual codebase, your real developer questions, and your specific cross-repository patterns. This is where most teams skip steps and pay for it later. They grab some generic dataset, run evaluations, get numbers that look reasonable, and then wonder why their system fails on production queries. The disconnect happens because public benchmarks don't reflect how your organization's code actually connects. Let me walk through how to build an evaluation dataset that actually tells you whether your retrieval system works. ## What Makes a Good Evaluation Dataset Your dataset needs four things to be useful. **Coverage across query types.** Remember the query taxonomy from earlier—single-repo multi-file, cross-repo single concept, dependency chains, temporal queries? Your evaluation set needs examples of all of them. Heavy coverage of easy queries tells you nothing about where your system breaks. **Difficulty distribution that matches reality.** Some queries are straightforward. Find a function, return its implementation. Others require tracing through five repositories to understand how a payment flows from frontend to database. Your dataset should reflect this spread. **Human-verified gold retrievals.** This is non-negotiable. Someone who knows the codebase needs to confirm that yes, these are the files required to answer this question. LLM-generated "gold standards" introduce noise exactly where you need precision. **Version pinning.** Code changes. If your evaluation queries reference files that have been refactored since annotation, your measurements become meaningless. Pin everything to specific commits. ## How Many Queries Do You Actually Need Here's a rough breakdown that provides statistical reliability without requiring months of annotation work: | Query Type | Training/Calibration | Evaluation | Total | |------------|---------------------|------------|-------| | Single-Repo Multi-File | 50 | 100 | 150 | | Cross-Repo Single Concept | 30 | 75 | 105 | | Cross-Repo Dependency Chain | 20 | 50 | 70 | | Cross-Repo Temporal | 10 | 25 | 35 | | **Total** | **110** | **250** | **360** | Start with 250 evaluation queries minimum. Studies like CodeRAG-Bench use 100-500 queries per task type to get reliable signal. Below that threshold, random variation dominates your metrics. The distribution skews toward easier queries intentionally. You need enough hard queries to measure edge cases, but not so many that annotation becomes prohibitively expensive. ## Mining Real Questions from Your Development History The best evaluation queries come from questions developers actually asked. They're naturally phrased, they reflect real information needs, and the answers already exist somewhere in your documentation or discussions. ### Pull Request Comments PR comments are gold. When someone asks "why does this call the payment service here?" during review, they're articulating exactly the kind of question your RAG system should answer. ```python def extract_pr_questions(repo: str, since: datetime) -> List[Query]: """ Find comments that asked for context or explanation. """ patterns = [ r"why (did|do|does|is|are|was|were)", r"how (does|do|did|is|are)", r"what (is|are|does|do|happens)", r"where (is|are|does|do)", r"can you explain", r"I don't understand", ] # Extract matching comments # Pair with the files being reviewed (potential gold retrieval set) ``` The files being reviewed give you a starting point for gold retrievals. The follow-up discussion often reveals what additional context was needed. ### Code Review Discussions These are especially valuable for cross-repository queries. A question like "Why does this call the payment service here?" probably touches: - The file under review - The payment service implementation - Shared contracts or types between them When the reviewer eventually says "ah, got it"—that's your gold answer. The path from confusion to understanding maps directly to what your retrieval system needs to surface. ### Onboarding Documentation Gaps New developer questions reveal systematic knowledge gaps. These aren't edge cases. They're the questions every new team member asks because the answer spans multiple repositories and isn't documented anywhere obvious. Common patterns: - "How do I add a new API endpoint?" (touches backend, frontend, possibly database migrations) - "Where is authentication handled?" (security service, backend middleware, frontend state) - "What happens when a user signs up?" (frontend, backend, email service, database) These make excellent evaluation queries because they're high-value (asked repeatedly) and inherently cross-repository. ## Generating Queries Programmatically Mining historical questions won't give you complete coverage. You need to generate queries tha

Query Type	Training/Calibration	Evaluation	Total
Single-Repo Multi-File	50	100	150
Cross-Repo Single Concept	30	75	105
Cross-Repo Dependency Chain	20	50	70
Cross-Repo Temporal	10	25	35
Total	110	250	360

Building Your Evaluation Dataset from Organizational Repositories

Building Your Evaluation Dataset from Organizational Repositories

What Makes a Good Evaluation Dataset

How Many Queries Do You Actually Need

Mining Real Questions from Your Development History

Pull Request Comments

Code Review Discussions

Onboarding Documentation Gaps

Generating Queries Programmatically

Dependency Chain Queries

Schema Impact Queries

API Contract Queries

The Human Annotation Process

Week 1-2: Query Creation

Week 2-3: Gold Retrieval Annotation

Week 3-4: Gold Answer Generation

Week 4-5: Cross-Validation

The Annotation Schema

Using LLMs to Scale Annotation

Common Pitfalls

What You Should Have After This