Why RAG Exists: Context Windows Are Too Small for Your Data
The Library Analogy
You need to answer a question about marine biology. You have two options:
Option A: Carry the entire library (10,000 books) to your desk. Search through all of them while answering.
Option B: Use the card catalog to find the 3 most relevant books. Bring only those to your desk.
Option A is context stuffing. Option B is RAG.
Even if your desk were big enough for 10,000 books, you’d be slower and more confused than with the 3 right books. This is the fundamental insight behind RAG.
The Three Problems with Context Stuffing
Problem 1: It Doesn’t Fit
Most organizations have far more data than any context window:
Even a 1M-token context window can only hold ~100K lines of code — a fraction of a real codebase.
Problem 2: It Degrades Quality
As we covered in earlier blogs, attention dilution means the model gets worse with more context:
Stuffing 500K tokens when only 5K are relevant means the model is ~40% less accurate than it needs to be.
Problem 3: It’s Expensive
Cost scales with context length:
Sending 500K tokens at 1.50 per query. At 100 queries/day, that’s 0.015/query, or $1.50/day. 100× cheaper.
How RAG Works
from dataclasses import dataclass
@dataclass
class Document:
content: str
embedding: list[float]
metadata: dict
def rag_pipeline(
query: str,
knowledge_base: list[Document],
embed_fn, # Function to compute embeddings
llm_fn, # Function to call the LLM
top_k: int = 5,
) -> str:
"""
Complete RAG pipeline.
1. Embed the query
2. Find similar documents (retrieval)
3. Construct prompt with retrieved context
4. Generate answer with LLM
"""
import numpy as np
# Step 1: Embed the query
query_embedding = embed_fn(query)
# Step 2: Retrieve top-k similar documents
similarities = []
for doc in knowledge_base:
# Cosine similarity
sim = np.dot(query_embedding, doc.embedding) / (
np.linalg.norm(query_embedding) * np.linalg.norm(doc.embedding)
)
similarities.append((sim, doc))
# Sort by similarity, take top-k
similarities.sort(key=lambda x: x[0], reverse=True)
retrieved = [doc for _, doc in similarities[:top_k]]
# Step 3: Construct prompt
context = "\n\n".join([
f"Source {i+1}:\n{doc.content}"
for i, doc in enumerate(retrieved)
])
prompt = f"""Use the following sources to answer the question.
If the sources don't contain the answer, say so.
{context}
Question: {query}
Answer:"""
# Step 4: Generate with LLM
return llm_fn(prompt)When to Use RAG vs. Long Context vs. Fine-Tuning
| Approach | Best For | Data Size | Latency | Cost |
|---|---|---|---|---|
| Short context | Simple Q&A, code generation | < 4K tokens | Low | Low |
| Long context | Document analysis, code review | 4K–200K tokens | Medium | Medium |
| RAG | Large knowledge bases, up-to-date data | Any size | Medium | Low per query |
| Fine-tuning | Teaching new behaviors/domains | N/A (baked in) | Low | High upfront |
Decision Framework
def recommend_approach(
total_data_tokens: int,
query_frequency: int, # queries per day
data_update_frequency: str, # "static", "daily", "realtime"
task_type: str, # "retrieval", "reasoning", "generation"
) -> str:
"""Recommend the best approach based on use case."""
if total_data_tokens < 4_000:
return "Direct: fits in any context window"
if total_data_tokens < 100_000 and task_type == "reasoning":
return "Long context: need to reason across all data"
if data_update_frequency in ("daily", "realtime"):
return "RAG: data changes too frequently for fine-tuning"
if total_data_tokens > 500_000:
return "RAG: data too large for context window"
if query_frequency > 100:
return "RAG: cheaper at high query volume"
return "Long context with selective loading"The Quality Comparison
For a document retrieval task, here’s how approaches compare:
| Approach | Accuracy (10 docs) | Accuracy (100 docs) | Accuracy (1000 docs) |
|---|---|---|---|
| Context stuff all | 95% | 72% | 45% |
| RAG (top-5) | 92% | 89% | 87% |
| RAG (top-10) | 94% | 91% | 89% |
| Hybrid (RAG + long context) | 96% | 93% | 91% |
Context stuffing degrades rapidly with more documents. RAG maintains quality because it always retrieves a manageable number of relevant documents.
The Cost Crossover
At what point does RAG become cheaper than context stuffing?
Where = number of retrieved docs, = average doc size, = embedding + search cost.
The crossover:
def cost_crossover(
k: int = 5,
avg_doc_tokens: int = 1000,
retrieval_cost: float = 0.0001, # $ per query
input_price: float = 3.0, # $ per M tokens
) -> int:
"""Find the context size where RAG becomes cheaper."""
rag_tokens = k * avg_doc_tokens
crossover = rag_tokens + int(retrieval_cost * 1e6 / input_price)
return crossover
cross = cost_crossover()
print(f"RAG becomes cheaper above {cross:,} tokens of total data")
# Output: RAG becomes cheaper above ~5,033 tokensRAG is cheaper for any dataset larger than ~5K tokens — which includes virtually all real-world use cases.
Building a Simple RAG System
import numpy as np
class SimpleRAG:
"""A minimal but complete RAG system."""
def __init__(self, embed_fn, chunk_size=500, chunk_overlap=50):
self.embed_fn = embed_fn
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.chunks = []
self.embeddings = []
def add_document(self, text: str, metadata: dict = None):
"""Chunk and index a document."""
words = text.split()
for i in range(0, len(words), self.chunk_size - self.chunk_overlap):
chunk = " ".join(words[i:i + self.chunk_size])
embedding = self.embed_fn(chunk)
self.chunks.append({
"content": chunk,
"metadata": metadata or {},
"embedding": embedding,
})
self.embeddings.append(embedding)
self.embeddings_matrix = np.array(self.embeddings)
def search(self, query: str, top_k: int = 5) -> list[dict]:
"""Retrieve the most relevant chunks for a query."""
query_emb = np.array(self.embed_fn(query))
# Cosine similarity against all chunks
similarities = self.embeddings_matrix @ query_emb / (
np.linalg.norm(self.embeddings_matrix, axis=1)
* np.linalg.norm(query_emb)
)
# Top-k indices
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [
{**self.chunks[i], "score": float(similarities[i])}
for i in top_indices
]
def query(self, question: str, llm_fn, top_k: int = 5) -> str:
"""Full RAG query: retrieve + generate."""
results = self.search(question, top_k)
context = "\n\n".join([
f"[Relevance: {r['score']:.3f}]\n{r['content']}"
for r in results
])
prompt = f"""Answer based on the following context:
{context}
Question: {question}
Answer:"""
return llm_fn(prompt)The Hybrid Approach
The best systems combine RAG with long context:
- RAG retrieves the 5-10 most relevant documents
- Long context holds the retrieved documents + conversation history
- The model reasons over the retrieved information with full attention
This avoids the worst of both worlds:
- No attention dilution from irrelevant content (RAG filters it out)
- Full reasoning capability over retrieved content (long context enables deep analysis)
ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Private Code Context retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai