Prompt caching saves 90% on repeated context. For coding agents carrying 20-40K tokens of system prompts, the savings are enormous.
Imagine buying a movie ticket every time you visit the theater. Now imagine buying a season pass — you pay once (slightly more than a single ticket) and then get heavily discounted entry for every subsequent visit.
Prompt caching works the same way. The first time you send a long system prompt, you “cache” it (slightly more expensive). Every subsequent request with the same prefix reads from cache at 90% discount.
| Operation | Price (relative to base) |
|---|---|
| Cache write | 1.25× base input price |
| Cache read | 0.10× base input price |
| Normal input | 1.00× base input price |
For Claude Sonnet at $3/M base:
Caching costs more on the first request (1.25×) but saves on all subsequent requests (0.10×). The breakeven number of requests:
Caching pays off after just 2 requests!
def caching_savings(
prefix_tokens: int,
n_requests: int,
new_tokens_per_request: int = 500,
base_input_price: float = 3.0, # $/M
cache_write_multiplier: float = 1.25,
cache_read_multiplier: float = 0.10,
output_tokens: int = 500,
output_price: float = 15.0, # $/M
) -> dict:
"""Compare costs with and without caching."""
# Without caching
cost_no_cache = n_requests * (
(prefix_tokens + new_tokens_per_request) / 1e6 * base_input_price
+ output_tokens / 1e6 * output_price
)
# With caching
write_cost = prefix_tokens / 1e6 * base_input_price * cache_write_multiplier
read_cost = (n_requests - 1) * prefix_tokens / 1e6 * base_input_price * cache_read_multiplier
new_cost = n_requests * new_tokens_per_request / 1e6 * base_input_price
output_cost = n_requests * output_tokens / 1e6 * output_price
cost_with_cache = write_cost + read_cost + new_cost + output_cost
savings = cost_no_cache - cost_with_cache
savings_pct = savings / cost_no_cache * 100
return {
"no_cache": cost_no_cache,
"with_cache": cost_with_cache,
"savings": savings,
"savings_pct": savings_pct,
}
# Typical coding agent scenario
scenarios = [
("Light usage", 5_000, 10),
("Coding session", 20_000, 50),
("Heavy agent", 40_000, 200),
("CI/CD pipeline", 30_000, 1000),
]
print(f"{'Scenario':<20} {'Prefix':>8} {'Requests':>8} "
f"{'No Cache':>10} {'Cached':>10} {'Savings':>10}")
print("=" * 70)
for name, prefix, n_req in scenarios:
r = caching_savings(prefix, n_req)
print(f"{name:<20} {prefix:>7,} {n_req:>8} "
f"${r['no_cache']:>9.2f} ${r['with_cache']:>9.2f} "
f"{r['savings_pct']:>9.1f}%")Output:
Scenario Prefix Requests No Cache Cached Savings
======================================================================
Light usage 5,000 10 $0.24 $0.11 55.4%
Coding session 20,000 50 $4.58 $0.95 79.2%
Heavy agent 40,000 200 $31.50 $4.64 85.3%
CI/CD pipeline 30,000 1000 $112.50 $16.13 85.7%For a heavy agent making 200 requests with a 40K-token system prompt, caching saves $26.86 per session — 85% reduction.
The cache key is the exact byte-for-byte prefix of the prompt. Any change — even a single character — invalidates the cache.
Cacheable (stays constant):
Not cacheable (changes each request):
# Structure your prompt for maximum cache hits:
def build_prompt(system_prompt, tools, static_context,
conversation_history, user_message):
"""
Place static content first (cacheable prefix),
dynamic content last (not cached).
"""
return [
# ─── CACHEABLE PREFIX (stays identical across requests) ───
{"role": "system", "content": system_prompt},
# Tool definitions are part of the API request structure
# and are also cached if identical
{"role": "system", "content": static_context},
# ─── DYNAMIC SUFFIX (changes each request) ───
*conversation_history,
{"role": "user", "content": user_message},
]
# Example: Claude Code agent
system_prompt = """You are an AI coding assistant...""" # 3K tokens
tools = """[tool definitions for bash, read, edit, write...]""" # 10K tokens
claude_md = """[project CLAUDE.md file contents]""" # 5K tokens
# Total cacheable prefix: ~18K tokens
# With caching at $3/M input:
# Without cache: 18K × $3/M = $0.054 per request
# With cache read: 18K × $0.30/M = $0.0054 per request
# Savings: $0.0486 per request → $4.86 per 100 requestsMulti-turn conversations present a challenge: the conversation history changes each turn, breaking the cache. The solution: cache the static prefix, not the conversation.
def multi_turn_with_caching(
n_turns: int,
system_tokens: int = 15_000, # Cached prefix
tokens_per_turn: int = 1000,
base_price: float = 3.0,
):
"""Compare multi-turn costs with and without caching."""
total_no_cache = 0
total_with_cache = 0
for turn in range(1, n_turns + 1):
conversation_tokens = (turn - 1) * tokens_per_turn
new_tokens = tokens_per_turn
# Without caching: pay full price for everything
turn_tokens = system_tokens + conversation_tokens + new_tokens
total_no_cache += turn_tokens / 1e6 * base_price
# With caching: pay cache-read for system, full for rest
cached_cost = system_tokens / 1e6 * base_price * 0.10 # Cache read
dynamic_cost = (conversation_tokens + new_tokens) / 1e6 * base_price
total_with_cache += cached_cost + dynamic_cost
# Add cache write cost (first request)
total_with_cache += system_tokens / 1e6 * base_price * 0.15 # Extra 0.25× - 0.10×
savings_pct = (1 - total_with_cache / total_no_cache) * 100
print(f"\n{n_turns}-turn conversation, {system_tokens:,} token cached prefix:")
print(f" Without caching: ${total_no_cache:.4f}")
print(f" With caching: ${total_with_cache:.4f}")
print(f" Savings: {savings_pct:.1f}%")
for n in [5, 10, 20, 50]:
multi_turn_with_caching(n)Put static content first. System prompt, tools, and project context should be the prefix of every request.
Don’t include timestamps in cached content. A timestamp changes every second, breaking the cache.
Use consistent formatting. Even whitespace differences invalidate the cache.
Monitor cache hit rates. Most providers offer metrics on cache utilization.
Design for caching from the start. Restructuring prompts for caching after deployment is painful.
Consider the 5-minute TTL. If requests are more than 5 minutes apart, the cache expires and you pay write cost again.
Where = requests per month.
For a coding agent with 30K prefix tokens, 10,000 requests/month:
That’s $810/month saved by a configuration change that takes 5 minutes to implement.
ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai