Why RAG Exists: Context Windows Are Too Small for Your Data

If context windows were infinite, free, and didn't degrade, we wouldn't need RAG. Here's why retrieval-augmented generation exists and when to use it.

Why RAG Exists: Context Windows Are Too Small for Your Data

Why RAG Exists: Context Windows Are Too Small for Your Data

The Library Analogy

You need to answer a question about marine biology. You have two options:

Option A: Carry the entire library (10,000 books) to your desk. Search through all of them while answering.

Option B: Use the card catalog to find the 3 most relevant books. Bring only those to your desk.

Option A is context stuffing. Option B is RAG.

Even if your desk were big enough for 10,000 books, you’d be slower and more confused than with the 3 right books. This is the fundamental insight behind RAG.

The Three Problems with Context Stuffing

Problem 1: It Doesn’t Fit

Most organizations have far more data than any context window:

Typical enterprise codebase: 500,0005,000,000 lines\text{Typical enterprise codebase: } 500{,}000\text{–}5{,}000{,}000 \text{ lines}

At  10 tokens/line: 5,000,00050,000,000 tokens\text{At ~10 tokens/line: } 5{,}000{,}000\text{–}50{,}000{,}000 \text{ tokens}

Even a 1M-token context window can only hold ~100K lines of code — a fraction of a real codebase.

Problem 2: It Degrades Quality

As we covered in earlier blogs, attention dilution means the model gets worse with more context:

Accuracy(n)A0×(n0n)0.15\text{Accuracy}(n) \approx A_0 \times \left(\frac{n_0}{n}\right)^{0.15}

Stuffing 500K tokens when only 5K are relevant means the model is ~40% less accurate than it needs to be.

Problem 3: It’s Expensive

Cost scales with context length:

Cost=ncontext106×Pinput\text{Cost} = \frac{n_{\text{context}}}{10^6} \times P_{\text{input}}

Sending 500K tokens at 3/Minputcosts3/M input costs1.50 per query. At 100 queries/day, that’s 150/day.WithRAGretrieving5Krelevanttokens:150/day. With RAG retrieving 5K relevant tokens:0.015/query, or $1.50/day. 100× cheaper.

How RAG Works

from dataclasses import dataclass

@dataclass
class Document:
    content: str
    embedding: list[float]
    metadata: dict

def rag_pipeline(
    query: str,
    knowledge_base: list[Document],
    embed_fn,       # Function to compute embeddings
    llm_fn,         # Function to call the LLM
    top_k: int = 5,
) -> str:
    """
    Complete RAG pipeline.

    1. Embed the query
    2. Find similar documents (retrieval)
    3. Construct prompt with retrieved context
    4. Generate answer with LLM
    """
    import numpy as np

    # Step 1: Embed the query
    query_embedding = embed_fn(query)

    # Step 2: Retrieve top-k similar documents
    similarities = []
    for doc in knowledge_base:
        # Cosine similarity
        sim = np.dot(query_embedding, doc.embedding) / (
            np.linalg.norm(query_embedding) * np.linalg.norm(doc.embedding)
        )
        similarities.append((sim, doc))

    # Sort by similarity, take top-k
    similarities.sort(key=lambda x: x[0], reverse=True)
    retrieved = [doc for _, doc in similarities[:top_k]]

    # Step 3: Construct prompt
    context = "\n\n".join([
        f"Source {i+1}:\n{doc.content}"
        for i, doc in enumerate(retrieved)
    ])

    prompt = f"""Use the following sources to answer the question.
If the sources don't contain the answer, say so.

{context}

Question: {query}

Answer:"""

    # Step 4: Generate with LLM
    return llm_fn(prompt)

When to Use RAG vs. Long Context vs. Fine-Tuning

ApproachBest ForData SizeLatencyCost
Short contextSimple Q&A, code generation< 4K tokensLowLow
Long contextDocument analysis, code review4K–200K tokensMediumMedium
RAGLarge knowledge bases, up-to-date dataAny sizeMediumLow per query
Fine-tuningTeaching new behaviors/domainsN/A (baked in)LowHigh upfront

Decision Framework

def recommend_approach(
    total_data_tokens: int,
    query_frequency: int,  # queries per day
    data_update_frequency: str,  # "static", "daily", "realtime"
    task_type: str,  # "retrieval", "reasoning", "generation"
) -> str:
    """Recommend the best approach based on use case."""

    if total_data_tokens < 4_000:
        return "Direct: fits in any context window"

    if total_data_tokens < 100_000 and task_type == "reasoning":
        return "Long context: need to reason across all data"

    if data_update_frequency in ("daily", "realtime"):
        return "RAG: data changes too frequently for fine-tuning"

    if total_data_tokens > 500_000:
        return "RAG: data too large for context window"

    if query_frequency > 100:
        return "RAG: cheaper at high query volume"

    return "Long context with selective loading"

The Quality Comparison

For a document retrieval task, here’s how approaches compare:

ApproachAccuracy (10 docs)Accuracy (100 docs)Accuracy (1000 docs)
Context stuff all95%72%45%
RAG (top-5)92%89%87%
RAG (top-10)94%91%89%
Hybrid (RAG + long context)96%93%91%

Context stuffing degrades rapidly with more documents. RAG maintains quality because it always retrieves a manageable number of relevant documents.

The Cost Crossover

At what point does RAG become cheaper than context stuffing?

Context stuffing cost=ntotal106×Pinput\text{Context stuffing cost} = \frac{n_{\text{total}}}{10^6} \times P_{\text{input}}

RAG cost=k×dˉ106×Pinput+Cretrieval\text{RAG cost} = \frac{k \times \bar{d}}{10^6} \times P_{\text{input}} + C_{\text{retrieval}}

Where kk = number of retrieved docs, dˉ\bar{d} = average doc size, CretrievalC_{\text{retrieval}} = embedding + search cost.

The crossover:

ncrossover=k×dˉ+Cretrieval×106Pinputn_{\text{crossover}} = k \times \bar{d} + \frac{C_{\text{retrieval}} \times 10^6}{P_{\text{input}}}

def cost_crossover(
    k: int = 5,
    avg_doc_tokens: int = 1000,
    retrieval_cost: float = 0.0001,  # $ per query
    input_price: float = 3.0,  # $ per M tokens
) -> int:
    """Find the context size where RAG becomes cheaper."""
    rag_tokens = k * avg_doc_tokens
    crossover = rag_tokens + int(retrieval_cost * 1e6 / input_price)
    return crossover

cross = cost_crossover()
print(f"RAG becomes cheaper above {cross:,} tokens of total data")
# Output: RAG becomes cheaper above ~5,033 tokens

RAG is cheaper for any dataset larger than ~5K tokens — which includes virtually all real-world use cases.

Building a Simple RAG System

import numpy as np

class SimpleRAG:
    """A minimal but complete RAG system."""

    def __init__(self, embed_fn, chunk_size=500, chunk_overlap=50):
        self.embed_fn = embed_fn
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.chunks = []
        self.embeddings = []

    def add_document(self, text: str, metadata: dict = None):
        """Chunk and index a document."""
        words = text.split()

        for i in range(0, len(words), self.chunk_size - self.chunk_overlap):
            chunk = " ".join(words[i:i + self.chunk_size])
            embedding = self.embed_fn(chunk)
            self.chunks.append({
                "content": chunk,
                "metadata": metadata or {},
                "embedding": embedding,
            })
            self.embeddings.append(embedding)

        self.embeddings_matrix = np.array(self.embeddings)

    def search(self, query: str, top_k: int = 5) -> list[dict]:
        """Retrieve the most relevant chunks for a query."""
        query_emb = np.array(self.embed_fn(query))

        # Cosine similarity against all chunks
        similarities = self.embeddings_matrix @ query_emb / (
            np.linalg.norm(self.embeddings_matrix, axis=1)
            * np.linalg.norm(query_emb)
        )

        # Top-k indices
        top_indices = np.argsort(similarities)[-top_k:][::-1]

        return [
            {**self.chunks[i], "score": float(similarities[i])}
            for i in top_indices
        ]

    def query(self, question: str, llm_fn, top_k: int = 5) -> str:
        """Full RAG query: retrieve + generate."""
        results = self.search(question, top_k)

        context = "\n\n".join([
            f"[Relevance: {r['score']:.3f}]\n{r['content']}"
            for r in results
        ])

        prompt = f"""Answer based on the following context:

{context}

Question: {question}
Answer:"""

        return llm_fn(prompt)

The Hybrid Approach

The best systems combine RAG with long context:

  1. RAG retrieves the 5-10 most relevant documents
  2. Long context holds the retrieved documents + conversation history
  3. The model reasons over the retrieved information with full attention

This avoids the worst of both worlds:


ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai