The Cost Curve: How Context Length Affects Your API Bill
The Taxi Meter Analogy
Imagine a taxi where the meter doesn’t just measure distance — it measures total distance traveled since the trip started. Each block costs more than the last because the meter includes all previous blocks. By block 20, you’re paying for the cumulative distance of all 20 blocks, not just the 20th one.
This is exactly how multi-turn AI conversations work. Each turn carries all previous context, making later turns exponentially more expensive.
The Basic Cost Formula
Where:
- = total input tokens
- = total output tokens
- = price per million input tokens
- = price per million output tokens
Simple enough. But in a multi-turn conversation, is not just your messages — it’s the entire conversation history.
Multi-Turn Accumulation
In a conversation where each turn adds tokens (user message + AI response):
| Turn | Input Tokens | Cumulative |
|---|---|---|
| 1 | ||
| 2 | ||
| 3 | ||
| … | … | … |
Total input tokens across turns:
This is quadratic in ! A 20-turn conversation doesn’t cost 20× a single turn — it costs:
def multi_turn_cost(
n_turns: int,
tokens_per_user_msg: int = 200,
tokens_per_ai_response: int = 800,
system_prompt_tokens: int = 2000,
price_input: float = 3.0, # $ per M tokens
price_output: float = 15.0, # $ per M tokens
) -> dict:
"""
Calculate the total cost of a multi-turn conversation.
Each turn's input includes:
- System prompt (constant)
- All previous messages (grows linearly)
- Current user message
"""
tokens_per_turn = tokens_per_user_msg + tokens_per_ai_response
total_input = 0
total_output = 0
for turn in range(1, n_turns + 1):
# Input: system prompt + all previous turns + current message
input_tokens = system_prompt_tokens + (turn - 1) * tokens_per_turn + tokens_per_user_msg
output_tokens = tokens_per_ai_response
total_input += input_tokens
total_output += output_tokens
input_cost = total_input / 1e6 * price_input
output_cost = total_output / 1e6 * price_output
return {
"turns": n_turns,
"total_input_tokens": total_input,
"total_output_tokens": total_output,
"input_cost": input_cost,
"output_cost": output_cost,
"total_cost": input_cost + output_cost,
"cost_per_turn": (input_cost + output_cost) / n_turns,
"naive_expectation": n_turns * (
(system_prompt_tokens + tokens_per_user_msg) / 1e6 * price_input +
tokens_per_ai_response / 1e6 * price_output
),
}
# Show cost scaling
print(f"{'Turns':>6} {'Input Tokens':>14} {'Total Cost':>12} "
f"{'Naive Expected':>15} {'Actual/Naive':>12}")
print("=" * 65)
for n in [1, 5, 10, 20, 50, 100]:
result = multi_turn_cost(n)
ratio = result["total_cost"] / result["naive_expectation"]
print(f"{n:>6} {result['total_input_tokens']:>14,} "
f"${result['total_cost']:>11.4f} "
f"${result['naive_expectation']:>14.4f} {ratio:>11.1f}×")Output:
Turns Input Tokens Total Cost Naive Expected Actual/Naive
=================================================================
1 2,200 $0.0186 $0.0186 1.0×
5 15,000 $0.1050 $0.0930 1.1×
10 54,000 $0.2820 $0.1860 1.5×
20 212,000 $0.8760 $0.3720 2.4×
50 1,282,000 $4.7460 $0.9300 5.1×
100 5,060,000 $17.1800 $1.8600 9.2×A 100-turn conversation costs 9.2× more than you’d naively expect!
Provider Cost Comparison
Current API pricing (approximate, 2026):
| Provider | Model | Input ($/M) | Output ($/M) |
|---|---|---|---|
| Anthropic | Claude Sonnet | $3.00 | $15.00 |
| Anthropic | Claude Haiku | $0.25 | $1.25 |
| OpenAI | GPT-4o | $2.50 | $10.00 |
| OpenAI | GPT-4o-mini | $0.15 | $0.60 |
| Gemini 1.5 Pro | $1.25 | $5.00 |
providers = [
("Claude Sonnet", 3.00, 15.00),
("Claude Haiku", 0.25, 1.25),
("GPT-4o", 2.50, 10.00),
("GPT-4o-mini", 0.15, 0.60),
("Gemini 1.5 Pro", 1.25, 5.00),
]
# Cost for a 20-turn conversation
print(f"\n20-Turn Conversation Cost Comparison:")
print(f"{'Model':<20} {'Input Cost':>12} {'Output Cost':>12} {'Total':>12}")
print("=" * 60)
for name, p_in, p_out in providers:
result = multi_turn_cost(20, price_input=p_in, price_output=p_out)
print(f"{name:<20} ${result['input_cost']:>11.4f} "
f"${result['output_cost']:>11.4f} ${result['total_cost']:>11.4f}")Cost Optimization Strategies
1. Prompt Caching
Cache the system prompt and early conversation to avoid re-processing:
With Anthropic’s caching (reads at 0.1× base price):
def cost_with_caching(
n_turns: int,
cached_prefix_tokens: int = 5000, # System prompt + tools
**kwargs
) -> dict:
"""Cost with prompt caching (90% savings on cached prefix)."""
result = multi_turn_cost(n_turns, **kwargs)
# Savings: cached tokens are read at 10% price on all but first turn
cache_savings = (n_turns - 1) * cached_prefix_tokens / 1e6 * kwargs.get("price_input", 3.0) * 0.9
result["total_cost_cached"] = result["total_cost"] - cache_savings
result["cache_savings"] = cache_savings
return result2. Conversation Truncation
Only keep the last turns instead of the full history:
This caps input tokens at a constant regardless of conversation length.
3. Summarization
Periodically summarize older turns into a compact summary:
Typical compression: 10:1 to 20:1, reducing accumulated context significantly.
The “20 Questions” Cost Breakdown
Let’s trace a realistic 20-turn coding conversation:
def detailed_conversation_trace():
"""Detailed cost trace of a 20-turn coding conversation."""
system_prompt = 3000 # tokens
turns = [
("User asks about a bug", 150, 500),
("User shares code", 800, 600),
("User asks for explanation", 100, 1200),
("User requests changes", 200, 800),
("User reports error", 300, 400),
("User shares error log", 500, 600),
("User asks follow-up", 100, 800),
("User requests refactor", 150, 1500),
("User asks about testing", 100, 1000),
("User shares test results", 400, 600),
("User asks for optimization", 100, 1200),
("User shares benchmark", 300, 500),
("User requests documentation", 100, 800),
("User asks about deployment", 150, 600),
("User shares config", 600, 400),
("User reports new issue", 200, 800),
("User asks for review", 100, 1500),
("User requests final changes", 150, 600),
("User asks about CI/CD", 100, 800),
("User says thanks", 50, 200),
]
cumulative = system_prompt
total_input = 0
total_output = 0
price_in = 3.0 / 1e6
price_out = 15.0 / 1e6
print(f"{'Turn':>4} {'Description':<30} {'Input':>8} {'Output':>8} "
f"{'Turn $':>8} {'Running $':>10}")
print("=" * 75)
running_cost = 0
for i, (desc, user_tok, ai_tok) in enumerate(turns, 1):
input_tokens = cumulative + user_tok
turn_cost = input_tokens * price_in + ai_tok * price_out
running_cost += turn_cost
total_input += input_tokens
total_output += ai_tok
cumulative += user_tok + ai_tok
print(f"{i:>4} {desc:<30} {input_tokens:>8,} {ai_tok:>8} "
f"${turn_cost:>7.4f} ${running_cost:>9.4f}")
print("=" * 75)
print(f"{'':>4} {'TOTAL':<30} {total_input:>8,} {total_output:>8} "
f"{'':>8} ${running_cost:>9.4f}")
detailed_conversation_trace()The total cost of this realistic 20-turn conversation is typically $2-5 — much more than most users expect from “just chatting.”
Key Takeaways
Multi-turn costs grow quadratically. Turn processes all previous context, so total cost scales as .
The cost you feel ≠ the cost you pay. A 20-turn conversation feels like 20 simple queries but costs 2-3× more due to accumulation.
Prompt caching is essential. For production systems with repeated prefixes, caching saves 50-90% on input costs.
Strategic truncation saves money. Keeping only the last 5-10 turns instead of the full history caps costs at a constant.
Know your provider’s pricing. The difference between Sonnet and Haiku can be 10×, and between Sonnet and GPT-4o-mini can be 20×.
ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Private Code Context retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai