If context windows were infinite, free, and didn't degrade, we wouldn't need RAG. Here's why retrieval-augmented generation exists and when to use it.
You need to answer a question about marine biology. You have two options:
Option A: Carry the entire library (10,000 books) to your desk. Search through all of them while answering.
Option B: Use the card catalog to find the 3 most relevant books. Bring only those to your desk.
Option A is context stuffing. Option B is RAG.
Even if your desk were big enough for 10,000 books, you’d be slower and more confused than with the 3 right books. This is the fundamental insight behind RAG.
Most organizations have far more data than any context window:
Even a 1M-token context window can only hold ~100K lines of code — a fraction of a real codebase.
As we covered in earlier blogs, attention dilution means the model gets worse with more context:
Stuffing 500K tokens when only 5K are relevant means the model is ~40% less accurate than it needs to be.
Cost scales with context length:
Sending 500K tokens at 1.50 per query. At 100 queries/day, that’s 0.015/query, or $1.50/day. 100× cheaper.
from dataclasses import dataclass
@dataclass
class Document:
content: str
embedding: list[float]
metadata: dict
def rag_pipeline(
query: str,
knowledge_base: list[Document],
embed_fn, # Function to compute embeddings
llm_fn, # Function to call the LLM
top_k: int = 5,
) -> str:
"""
Complete RAG pipeline.
1. Embed the query
2. Find similar documents (retrieval)
3. Construct prompt with retrieved context
4. Generate answer with LLM
"""
import numpy as np
# Step 1: Embed the query
query_embedding = embed_fn(query)
# Step 2: Retrieve top-k similar documents
similarities = []
for doc in knowledge_base:
# Cosine similarity
sim = np.dot(query_embedding, doc.embedding) / (
np.linalg.norm(query_embedding) * np.linalg.norm(doc.embedding)
)
similarities.append((sim, doc))
# Sort by similarity, take top-k
similarities.sort(key=lambda x: x[0], reverse=True)
retrieved = [doc for _, doc in similarities[:top_k]]
# Step 3: Construct prompt
context = "\n\n".join([
f"Source {i+1}:\n{doc.content}"
for i, doc in enumerate(retrieved)
])
prompt = f"""Use the following sources to answer the question.
If the sources don't contain the answer, say so.
{context}
Question: {query}
Answer:"""
# Step 4: Generate with LLM
return llm_fn(prompt)| Approach | Best For | Data Size | Latency | Cost |
|---|---|---|---|---|
| Short context | Simple Q&A, code generation | < 4K tokens | Low | Low |
| Long context | Document analysis, code review | 4K–200K tokens | Medium | Medium |
| RAG | Large knowledge bases, up-to-date data | Any size | Medium | Low per query |
| Fine-tuning | Teaching new behaviors/domains | N/A (baked in) | Low | High upfront |
def recommend_approach(
total_data_tokens: int,
query_frequency: int, # queries per day
data_update_frequency: str, # "static", "daily", "realtime"
task_type: str, # "retrieval", "reasoning", "generation"
) -> str:
"""Recommend the best approach based on use case."""
if total_data_tokens < 4_000:
return "Direct: fits in any context window"
if total_data_tokens < 100_000 and task_type == "reasoning":
return "Long context: need to reason across all data"
if data_update_frequency in ("daily", "realtime"):
return "RAG: data changes too frequently for fine-tuning"
if total_data_tokens > 500_000:
return "RAG: data too large for context window"
if query_frequency > 100:
return "RAG: cheaper at high query volume"
return "Long context with selective loading"For a document retrieval task, here’s how approaches compare:
| Approach | Accuracy (10 docs) | Accuracy (100 docs) | Accuracy (1000 docs) |
|---|---|---|---|
| Context stuff all | 95% | 72% | 45% |
| RAG (top-5) | 92% | 89% | 87% |
| RAG (top-10) | 94% | 91% | 89% |
| Hybrid (RAG + long context) | 96% | 93% | 91% |
Context stuffing degrades rapidly with more documents. RAG maintains quality because it always retrieves a manageable number of relevant documents.
At what point does RAG become cheaper than context stuffing?
Where = number of retrieved docs, = average doc size, = embedding + search cost.
The crossover:
def cost_crossover(
k: int = 5,
avg_doc_tokens: int = 1000,
retrieval_cost: float = 0.0001, # $ per query
input_price: float = 3.0, # $ per M tokens
) -> int:
"""Find the context size where RAG becomes cheaper."""
rag_tokens = k * avg_doc_tokens
crossover = rag_tokens + int(retrieval_cost * 1e6 / input_price)
return crossover
cross = cost_crossover()
print(f"RAG becomes cheaper above {cross:,} tokens of total data")
# Output: RAG becomes cheaper above ~5,033 tokensRAG is cheaper for any dataset larger than ~5K tokens — which includes virtually all real-world use cases.
import numpy as np
class SimpleRAG:
"""A minimal but complete RAG system."""
def __init__(self, embed_fn, chunk_size=500, chunk_overlap=50):
self.embed_fn = embed_fn
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.chunks = []
self.embeddings = []
def add_document(self, text: str, metadata: dict = None):
"""Chunk and index a document."""
words = text.split()
for i in range(0, len(words), self.chunk_size - self.chunk_overlap):
chunk = " ".join(words[i:i + self.chunk_size])
embedding = self.embed_fn(chunk)
self.chunks.append({
"content": chunk,
"metadata": metadata or {},
"embedding": embedding,
})
self.embeddings.append(embedding)
self.embeddings_matrix = np.array(self.embeddings)
def search(self, query: str, top_k: int = 5) -> list[dict]:
"""Retrieve the most relevant chunks for a query."""
query_emb = np.array(self.embed_fn(query))
# Cosine similarity against all chunks
similarities = self.embeddings_matrix @ query_emb / (
np.linalg.norm(self.embeddings_matrix, axis=1)
* np.linalg.norm(query_emb)
)
# Top-k indices
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [
{**self.chunks[i], "score": float(similarities[i])}
for i in top_indices
]
def query(self, question: str, llm_fn, top_k: int = 5) -> str:
"""Full RAG query: retrieve + generate."""
results = self.search(question, top_k)
context = "\n\n".join([
f"[Relevance: {r['score']:.3f}]\n{r['content']}"
for r in results
])
prompt = f"""Answer based on the following context:
{context}
Question: {question}
Answer:"""
return llm_fn(prompt)The best systems combine RAG with long context:
This avoids the worst of both worlds:
ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai