Mar 31, 2026 prompt caching API optimization cache hit cost reduction Anthropic caching

Prompt caching saves 90% on repeated context. For coding agents carrying 20-40K tokens of system prompts, the savings are enormous.

Prompt Caching: How to Pay for Context Once

The Season Pass Analogy

Imagine buying a movie ticket every time you visit the theater. Now imagine buying a season pass — you pay once (slightly more than a single ticket) and then get heavily discounted entry for every subsequent visit.

Prompt caching works the same way. The first time you send a long system prompt, you “cache” it (slightly more expensive). Every subsequent request with the same prefix reads from cache at 90% discount.

How It Works

The Mechanics

First request: The full prompt is processed and its KV cache is stored server-side
Subsequent requests: If the prefix matches exactly, the cached KV is reused
Cache lifetime: Typically 5 minutes (refreshed on each use)

$\text{First request cost} = T_{\text{prefix}} \times P_{\text{cache\_write}} + T_{\text{new}} \times P_{\text{input}}$

$\text{Subsequent cost} = T_{\text{prefix}} \times P_{\text{cache\_read}} + T_{\text{new}} \times P_{\text{input}}$

Anthropic Pricing

Operation	Price (relative to base)
Cache write	1.25× base input price
Cache read	0.10× base input price
Normal input	1.00× base input price

For Claude Sonnet at $3/M base:

Cache write: $3.75/M
Cache read: $0.30/M
Normal: $3.00/M

The Math: When Does Caching Pay Off?

Breakeven Point

Caching costs more on the first request (1.25×) but saves on all subsequent requests (0.10×). The breakeven number of requests:

$n_{\text{breakeven}}: \quad P_{\text{write}} + (n-1) \times P_{\text{read}} < n \times P_{\text{normal}}$

$1.25 + (n-1) \times 0.10 < n \times 1.00$

$1.25 + 0.10n - 0.10 < n$

$1.15 < 0.90n$

$n > 1.28$

Caching pays off after just 2 requests!

def caching_savings(
    prefix_tokens: int,
    n_requests: int,
    new_tokens_per_request: int = 500,
    base_input_price: float = 3.0,  # $/M
    cache_write_multiplier: float = 1.25,
    cache_read_multiplier: float = 0.10,
    output_tokens: int = 500,
    output_price: float = 15.0,  # $/M
) -> dict:
    """Compare costs with and without caching."""

    # Without caching
    cost_no_cache = n_requests * (
        (prefix_tokens + new_tokens_per_request) / 1e6 * base_input_price
        + output_tokens / 1e6 * output_price
    )

    # With caching
    write_cost = prefix_tokens / 1e6 * base_input_price * cache_write_multiplier
    read_cost = (n_requests - 1) * prefix_tokens / 1e6 * base_input_price * cache_read_multiplier
    new_cost = n_requests * new_tokens_per_request / 1e6 * base_input_price
    output_cost = n_requests * output_tokens / 1e6 * output_price
    cost_with_cache = write_cost + read_cost + new_cost + output_cost

    savings = cost_no_cache - cost_with_cache
    savings_pct = savings / cost_no_cache * 100

    return {
        "no_cache": cost_no_cache,
        "with_cache": cost_with_cache,
        "savings": savings,
        "savings_pct": savings_pct,
    }


# Typical coding agent scenario
scenarios = [
    ("Light usage", 5_000, 10),
    ("Coding session", 20_000, 50),
    ("Heavy agent", 40_000, 200),
    ("CI/CD pipeline", 30_000, 1000),
]

print(f"{'Scenario':<20} {'Prefix':>8} {'Requests':>8} "
      f"{'No Cache':>10} {'Cached':>10} {'Savings':>10}")
print("=" * 70)

for name, prefix, n_req in scenarios:
    r = caching_savings(prefix, n_req)
    print(f"{name:<20} {prefix:>7,} {n_req:>8} "
          f"${r['no_cache']:>9.2f} ${r['with_cache']:>9.2f} "
          f"{r['savings_pct']:>9.1f}%")

Output:

Scenario             Prefix Requests    No Cache     Cached    Savings
======================================================================
Light usage           5,000       10      $0.24      $0.11      55.4%
Coding session       20,000       50      $4.58      $0.95      79.2%
Heavy agent          40,000      200     $31.50      $4.64      85.3%
CI/CD pipeline       30,000     1000    $112.50     $16.13      85.7%

For a heavy agent making 200 requests with a 40K-token system prompt, caching saves $26.86 per session — 85% reduction.

What Gets Cached

The cache key is the exact byte-for-byte prefix of the prompt. Any change — even a single character — invalidates the cache.

Cacheable (stays constant):

System prompt
Tool/function definitions
CLAUDE.md or project context
Few-shot examples
Static documentation

Not cacheable (changes each request):

User’s current message
Conversation history (it grows each turn)
Dynamic retrieval results
Timestamps

Optimal Prompt Structure

# Structure your prompt for maximum cache hits:

def build_prompt(system_prompt, tools, static_context,
                 conversation_history, user_message):
    """
    Place static content first (cacheable prefix),
    dynamic content last (not cached).
    """
    return [
        # ─── CACHEABLE PREFIX (stays identical across requests) ───
        {"role": "system", "content": system_prompt},
        # Tool definitions are part of the API request structure
        # and are also cached if identical
        {"role": "system", "content": static_context},

        # ─── DYNAMIC SUFFIX (changes each request) ───
        *conversation_history,
        {"role": "user", "content": user_message},
    ]

# Example: Claude Code agent
system_prompt = """You are an AI coding assistant..."""  # 3K tokens
tools = """[tool definitions for bash, read, edit, write...]"""  # 10K tokens
claude_md = """[project CLAUDE.md file contents]"""  # 5K tokens
# Total cacheable prefix: ~18K tokens

# With caching at $3/M input:
# Without cache: 18K × $3/M = $0.054 per request
# With cache read: 18K × $0.30/M = $0.0054 per request
# Savings: $0.0486 per request → $4.86 per 100 requests

Caching with Multi-Turn Conversations

Multi-turn conversations present a challenge: the conversation history changes each turn, breaking the cache. The solution: cache the static prefix, not the conversation.

def multi_turn_with_caching(
    n_turns: int,
    system_tokens: int = 15_000,  # Cached prefix
    tokens_per_turn: int = 1000,
    base_price: float = 3.0,
):
    """Compare multi-turn costs with and without caching."""

    total_no_cache = 0
    total_with_cache = 0

    for turn in range(1, n_turns + 1):
        conversation_tokens = (turn - 1) * tokens_per_turn
        new_tokens = tokens_per_turn

        # Without caching: pay full price for everything
        turn_tokens = system_tokens + conversation_tokens + new_tokens
        total_no_cache += turn_tokens / 1e6 * base_price

        # With caching: pay cache-read for system, full for rest
        cached_cost = system_tokens / 1e6 * base_price * 0.10  # Cache read
        dynamic_cost = (conversation_tokens + new_tokens) / 1e6 * base_price
        total_with_cache += cached_cost + dynamic_cost

    # Add cache write cost (first request)
    total_with_cache += system_tokens / 1e6 * base_price * 0.15  # Extra 0.25× - 0.10×

    savings_pct = (1 - total_with_cache / total_no_cache) * 100

    print(f"\n{n_turns}-turn conversation, {system_tokens:,} token cached prefix:")
    print(f"  Without caching: ${total_no_cache:.4f}")
    print(f"  With caching:    ${total_with_cache:.4f}")
    print(f"  Savings:         {savings_pct:.1f}%")

for n in [5, 10, 20, 50]:
    multi_turn_with_caching(n)

Best Practices

Put static content first. System prompt, tools, and project context should be the prefix of every request.
Don’t include timestamps in cached content. A timestamp changes every second, breaking the cache.
Use consistent formatting. Even whitespace differences invalidate the cache.
Monitor cache hit rates. Most providers offer metrics on cache utilization.
Design for caching from the start. Restructuring prompts for caching after deployment is painful.
Consider the 5-minute TTL. If requests are more than 5 minutes apart, the cache expires and you pay write cost again.

The ROI Formula

$\text{Monthly savings} = R \times T_{\text{prefix}} \times P_{\text{base}} \times (1 - \frac{P_{\text{read}}}{P_{\text{base}}}) / 10^6$

Where $R$ = requests per month.

For a coding agent with 30K prefix tokens, 10,000 requests/month:

$\text{Savings} = 10{,}000 \times 30{,}000 \times 3.0 \times 0.90 / 10^6 = \$810/\text{month}$

That’s $810/month saved by a configuration change that takes 5 minutes to implement.

ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Private Code Context retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai

← All posts