Why RAG Exists: Context Windows Are Too Small for Your Data

If context windows were infinite, free, and didn't degrade, we wouldn't need RAG. Here's why retrieval-augmented generation exists and when to use it.

Why RAG Exists: Context Windows Are Too Small for Your Data

Why RAG Exists: Context Windows Are Too Small for Your Data

The Library Analogy

You need to answer a question about marine biology. You have two options:

Option A: Carry the entire library (10,000 books) to your desk. Search through all of them while answering.

Option B: Use the card catalog to find the 3 most relevant books. Bring only those to your desk.

Option A is context stuffing. Option B is RAG.

Even if your desk were big enough for 10,000 books, you’d be slower and more confused than with the 3 right books. This is the fundamental insight behind RAG.

The Three Problems with Context Stuffing

Problem 1: It Doesn’t Fit

Most organizations have far more data than any context window:

Typical enterprise codebase: 500,0005,000,000 lines\text{Typical enterprise codebase: } 500{,}000\text{–}5{,}000{,}000 \text{ lines}

At  10 tokens/line: 5,000,00050,000,000 tokens\text{At ~10 tokens/line: } 5{,}000{,}000\text{–}50{,}000{,}000 \text{ tokens}

Even a 1M-token context window can only hold ~100K lines of code — a fraction of a real codebase.

Problem 2: It Degrades Quality

As we covered in earlier blogs, attention dilution means the model gets worse with more context:

Accuracy(n)A0×(n0n)0.15\text{Accuracy}(n) \approx A_0 \times \left(\frac{n_0}{n}\right)^{0.15}

Stuffing 500K tokens when only 5K are relevant means the model is ~40% less accurate than it needs to be.

Problem 3: It’s Expensive

Cost scales with context length:

Cost=ncontext106×Pinput\text{Cost} = \frac{n_{\text{context}}}{10^6} \times P_{\text{input}}

Sending 500K tokens at 3/Minputcosts3/M input costs1.50 per query. At 100 queries/day, that’s 150/day.WithRAGretrieving5Krelevanttokens:150/day. With RAG retrieving 5K relevant tokens:0.015/query, or $1.50/day. 100× cheaper.

How RAG Works

from dataclasses import dataclass

@dataclass
class Document:
    content: str
    embedding: list[float]
    metadata: dict

def rag_pipeline(
    query: str,
    knowledge_base: list[Document],
    embed_fn,       # Function to compute embeddings
    llm_fn,         # Function to call the LLM
    top_k: int = 5,
) -> str:
    """
    Complete RAG pipeline.

    1. Embed the query
    2. Find similar documents (retrieval)
    3. Construct prompt with retrieved context
    4. Generate answer with LLM
    """
    import numpy as np

    # Step 1: Embed the query
    query_embedding = embed_fn(query)

    # Step 2: Retrieve top-k similar documents
    similarities = []
    for doc in knowledge_base:
        # Cosine similarity
        sim = np.dot(query_embedding, doc.embedding) / (
            np.linalg.norm(query_embedding) * np.linalg.norm(doc.embedding)
        )
        similarities.append((sim, doc))

    # Sort by similarity, take top-k
    similarities.sort(key=lambda x: x[0], reverse=True)
    retrieved = [doc for _, doc in similarities[:top_k]]

    # Step 3: Construct prompt
    context = "\n\n".join([
        f"Source {i+1}:\n{doc.content}"
        for i, doc in enumerate(retrieved)
    ])

    prompt = f"""Use the following sources to answer the question.
If the sources don't contain the answer, say so.

{context}

Question: {query}

Answer:"""

    # Step 4: Generate with LLM
    return llm_fn(prompt)

When to Use RAG vs. Long Context vs. Fine-Tuning

ApproachBest ForData SizeLatencyCost
Short contextSimple Q&A, code generation< 4K tokensLowLow
Long contextDocument analysis, code review4K–200K tokensMediumMedium
RAGLarge knowledge bases, up-to-date dataAny sizeMediumLow per query
Fine-tuningTeaching new behaviors/domainsN/A (baked in)LowHigh upfront

Decision Framework

def recommend_approach(
    total_data_tokens: int,
    query_frequency: int,  # queries per day
    data_update_frequency: str,  # "static", "daily", "realtime"
    task_type: str,  # "retrieval", "reasoning", "generation"
) -> str:
    """Recommend the best approach based on use case."""

    if total_data_tokens < 4_000:
        return "Direct: fits in any context window"

    if total_data_tokens < 100_000 and task_type == "reasoning":
        return "Long context: need to reason across all data"

    if data_update_frequency in ("daily", "realtime"):
        return "RAG: data changes too frequently for fine-tuning"

    if total_data_tokens > 500_000:
        return "RAG: data too large for context window"

    if query_frequency > 100:
        return "RAG: cheaper at high query volume"

    return "Long context with selective loading"

The Quality Comparison

For a document retrieval task, here’s how approaches compare:

ApproachAccuracy (10 docs)Accuracy (100 docs)Accuracy (1000 docs)
Context stuff all95%72%45%
RAG (top-5)92%89%87%
RAG (top-10)94%91%89%
Hybrid (RAG + long context)96%93%91%

Context stuffing degrades rapidly with more documents. RAG maintains quality because it always retrieves a manageable number of relevant documents.

The Cost Crossover

At what point does RAG become cheaper than context stuffing?

Context stuffing cost=ntotal106×Pinput\text{Context stuffing cost} = \frac{n_{\text{total}}}{10^6} \times P_{\text{input}}

RAG cost=k×dˉ106×Pinput+Cretrieval\text{RAG cost} = \frac{k \times \bar{d}}{10^6} \times P_{\text{input}} + C_{\text{retrieval}}

Where kk = number of retrieved docs, dˉ\bar{d} = average doc size, CretrievalC_{\text{retrieval}} = embedding + search cost.

The crossover:

ncrossover=k×dˉ+Cretrieval×106Pinputn_{\text{crossover}} = k \times \bar{d} + \frac{C_{\text{retrieval}} \times 10^6}{P_{\text{input}}}

def cost_crossover(
    k: int = 5,
    avg_doc_tokens: int = 1000,
    retrieval_cost: float = 0.0001,  # $ per query
    input_price: float = 3.0,  # $ per M tokens
) -> int:
    """Find the context size where RAG becomes cheaper."""
    rag_tokens = k * avg_doc_tokens
    crossover = rag_tokens + int(retrieval_cost * 1e6 / input_price)
    return crossover

cross = cost_crossover()
print(f"RAG becomes cheaper above {cross:,} tokens of total data")
# Output: RAG becomes cheaper above ~5,033 tokens

RAG is cheaper for any dataset larger than ~5K tokens — which includes virtually all real-world use cases.

Building a Simple RAG System

import numpy as np

class SimpleRAG:
    """A minimal but complete RAG system."""

    def __init__(self, embed_fn, chunk_size=500, chunk_overlap=50):
        self.embed_fn = embed_fn
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.chunks = []
        self.embeddings = []

    def add_document(self, text: str, metadata: dict = None):
        """Chunk and index a document."""
        words = text.split()

        for i in range(0, len(words), self.chunk_size - self.chunk_overlap):
            chunk = " ".join(words[i:i + self.chunk_size])
            embedding = self.embed_fn(chunk)
            self.chunks.append({
                "content": chunk,
                "metadata": metadata or {},
                "embedding": embedding,
            })
            self.embeddings.append(embedding)

        self.embeddings_matrix = np.array(self.embeddings)

    def search(self, query: str, top_k: int = 5) -> list[dict]:
        """Retrieve the most relevant chunks for a query."""
        query_emb = np.array(self.embed_fn(query))

        # Cosine similarity against all chunks
        similarities = self.embeddings_matrix @ query_emb / (
            np.linalg.norm(self.embeddings_matrix, axis=1)
            * np.linalg.norm(query_emb)
        )

        # Top-k indices
        top_indices = np.argsort(similarities)[-top_k:][::-1]

        return [
            {**self.chunks[i], "score": float(similarities[i])}
            for i in top_indices
        ]

    def query(self, question: str, llm_fn, top_k: int = 5) -> str:
        """Full RAG query: retrieve + generate."""
        results = self.search(question, top_k)

        context = "\n\n".join([
            f"[Relevance: {r['score']:.3f}]\n{r['content']}"
            for r in results
        ])

        prompt = f"""Answer based on the following context:

{context}

Question: {question}
Answer:"""

        return llm_fn(prompt)

The Hybrid Approach

The best systems combine RAG with long context:

  1. RAG retrieves the 5-10 most relevant documents
  2. Long context holds the retrieved documents + conversation history
  3. The model reasons over the retrieved information with full attention

This avoids the worst of both worlds:

  • No attention dilution from irrelevant content (RAG filters it out)
  • Full reasoning capability over retrieved content (long context enables deep analysis)

ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Private Code Context retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai

← All posts