Tokens Explained: Why Your AI Counts Words Differently Than You Do
The Simple Version: What Is a Token?
Imagine you’re learning a new language. At first, you read letter by letter: C-A-T. But eventually, you learn to recognize whole words at a glance: “cat.” And then common phrases: “once upon a time.”
AI models do something similar — but instead of learning words, they learn tokens. A token is a chunk of text that the model has learned to recognize as a single unit. Sometimes a token is a whole word. Sometimes it’s part of a word. Sometimes it’s just a space or a punctuation mark.
Here’s the key insight: the AI never sees your text as words. It converts everything into tokens first, processes the tokens, then converts them back to text for you.
How Tokenization Works: BPE
The most common tokenization method is called Byte-Pair Encoding (BPE). Here’s how it works, step by step:
Step 1: Start with Individual Characters
Take the word “lowest”:
l, o, w, e, s, tStep 2: Find the Most Common Pair
The algorithm looks at a huge training corpus and finds which pair of adjacent characters appears most often. Let’s say “es” is the most common pair.
Step 3: Merge That Pair into a New Token
Now “lowest” becomes:
l, o, w, es, tStep 4: Repeat
Find the next most common pair. Maybe “lo” is common:
lo, w, es, tKeep going. Eventually “low” becomes a token, then “lowest” might become a single token if it appears often enough.
The Math of Vocabulary Size
Modern tokenizers have vocabularies of 50,000–100,000 tokens. The vocabulary size determines how the model maps between text and numbers:
Each token gets a unique integer ID. The sentence “Hello world” might become:
That’s it. Two numbers. That’s what the AI actually processes.
Token ≠ Word: The Examples
Let’s look at how different texts tokenize:
| Text | Tokens | Count |
|---|---|---|
| ”Hello” | [“Hello”] | 1 |
| ”hello” | [“hello”] | 1 |
| ”HELLO” | [“HE”, “LLO”] | 2 |
| ”Unconstitutional” | [“Un”, “const”, “itutional”] | 3-4 |
| ”AI” | [“AI”] | 1 |
| ”artificial intelligence” | [“artificial”, ” intelligence”] | 2 |
| ”🤖” | [”🤖“] | 1-2 |
Notice something? Capitalization matters. “HELLO” costs more tokens than “Hello” because all-caps words are less common in training data, so the tokenizer breaks them into smaller pieces.
Why Code Is More Expensive Than Prose
This is a crucial insight for developers. Code has higher token density than English prose.
Here’s a simple Python function:
def calculate_average(numbers: list[float]) -> float:
if not numbers:
raise ValueError("Cannot calculate average of empty list")
return sum(numbers) / len(numbers)This is only 4 lines of code, but it tokenizes into roughly 40–50 tokens. Why?
- Punctuation is expensive. Every
(,),:,[,],->is typically its own token - Indentation costs tokens. Spaces and tabs are tokenized too
- CamelCase and snake_case split.
calculate_averagemight become["calculate", "_average"] - Type hints add tokens.
list[float]is 4+ tokens
Let’s compare token density:
# Count tokens per line for different content types
import tiktoken
encoder = tiktoken.get_encoding("cl100k_base")
english_paragraph = """
The quick brown fox jumps over the lazy dog. This sentence contains
every letter of the English alphabet. It has been used since at least
the late 19th century for testing typewriters and fonts.
"""
python_code = """
def fibonacci(n: int) -> list[int]:
if n <= 0:
return []
elif n == 1:
return [0]
fib = [0, 1]
for i in range(2, n):
fib.append(fib[i-1] + fib[i-2])
return fib
"""
json_data = """
{
"users": [
{"id": 1, "name": "Alice", "email": "alice@example.com"},
{"id": 2, "name": "Bob", "email": "bob@example.com"},
{"id": 3, "name": "Charlie", "email": "charlie@example.com"}
]
}
"""
for name, text in [("English", english_paragraph),
("Python", python_code),
("JSON", json_data)]:
tokens = encoder.encode(text)
words = len(text.split())
chars = len(text)
print(f"{name}:")
print(f" Characters: {chars}")
print(f" Words: {words}")
print(f" Tokens: {len(tokens)}")
print(f" Chars/Token: {chars/len(tokens):.1f}")
print(f" Tokens/Word: {len(tokens)/words:.1f}")
print()Typical results:
| Content Type | Chars/Token | Tokens/Word |
|---|---|---|
| English prose | 4.0–4.5 | 1.2–1.4 |
| Python code | 3.0–3.5 | 1.8–2.5 |
| JSON | 2.5–3.0 | 2.5–3.5 |
| XML/HTML | 2.0–2.5 | 3.0–4.0 |
JSON and XML are token-expensive. All those brackets, quotes, and colons add up. A 1KB JSON file uses roughly 1.5× more tokens than a 1KB English text file.
The Token Economy: Why This Matters for Your Bill
AI APIs charge per token. Here’s the pricing math:
Where and are the per-million-token prices.
For Claude Sonnet at 15/M output:
A request with 10,000 input tokens and 2,000 output tokens costs:
But here’s what trips people up: in a multi-turn conversation, input tokens accumulate. Every turn carries all previous context:
- Turn 1: 1,000 input tokens
- Turn 2: 1,000 (previous) + 1,000 (new) = 2,000 input tokens
- Turn 3: 2,000 + 1,000 = 3,000 input tokens
- Turn 10: 10,000 input tokens
Total input tokens across 10 turns:
That’s 55,000 tokens for what felt like 10,000 tokens of conversation.
Different Models, Different Tokenizers
Not all models tokenize the same way. Here’s why:
- Different training data → Different token frequencies → Different merges in BPE
- Different vocabulary sizes → More tokens = finer granularity but larger embedding tables
| Model Family | Tokenizer | Vocab Size |
|---|---|---|
| GPT-4 / GPT-4o | cl100k_base / o200k_base | 100K / 200K |
| Claude | Claude tokenizer | ~100K |
| Llama 3 | Llama tokenizer | 128K |
| Gemini | SentencePiece | ~256K |
The same text can produce different token counts across models. “Hello world” might be 2 tokens in GPT-4 but 2 or 3 in another model.
How to Count Tokens: Practical Tools
Python (for OpenAI-compatible tokenizers)
import tiktoken
def count_tokens(text: str, model: str = "gpt-4") -> int:
"""Count the number of tokens in a text string."""
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
# Example usage
text = "What is the meaning of life?"
print(f"Token count: {count_tokens(text)}") # Output: 7JavaScript (for web applications)
// Using the js-tiktoken library
import { encodingForModel } from "js-tiktoken";
function countTokens(text, model = "gpt-4") {
const encoding = encodingForModel(model);
const tokens = encoding.encode(text);
return tokens.length;
}
console.log(countTokens("What is the meaning of life?")); // 7Quick Estimation (No Library Needed)
def estimate_tokens(text: str) -> int:
"""Quick estimate: ~4 characters per token for English."""
return len(text) // 4
def estimate_tokens_words(text: str) -> int:
"""Quick estimate: ~1.3 tokens per word for English."""
return int(len(text.split()) * 1.3)Special Tokens: The Hidden Costs
Beyond your text, models use special tokens that consume context space:
<|begin_of_text|>— marks the start of input<|end_of_text|>— marks the end<|start_header_id|>— marks role transitions (system, user, assistant)- Various tool-call markers
These special tokens are invisible to you but real in the token count. A simple “Hello” message might actually be 8–10 tokens after the framework adds role markers and formatting.
The Tokenization Inequality
Here’s a surprising fact: tokenization is not fair across languages.
English text tokenizes efficiently because most tokenizers are trained predominantly on English text. Other languages — especially those with non-Latin scripts — use more tokens for the same meaning.
| Language | ”Hello, how are you?” equivalent | Approx. Tokens |
|---|---|---|
| English | ”Hello, how are you?“ | 6 |
| Spanish | ”Hola, ¿cómo estás?“ | 7 |
| Chinese | ”你好,你怎么样?“ | 8–11 |
| Arabic | ”مرحبا، كيف حالك؟“ | 10–15 |
| Hindi | ”नमस्ते, आप कैसे हैं?“ | 15–20 |
This means non-English users effectively get a smaller context window and pay more per word — a significant equity issue in AI.
Key Takeaways
- Tokens ≠ words. One word can be 1–4+ tokens depending on complexity and language.
- Code costs more. Python uses ~2× more tokens per “word” than English prose. JSON and XML are even worse.
- Your bill grows quadratically in multi-turn conversations because input tokens accumulate.
- Different models count differently. Always use the right tokenizer for your model.
- Non-English text is more expensive. This is a fundamental property of BPE tokenization, not a pricing choice.
Understanding tokens is the foundation for understanding everything else about AI performance, cost, and context management.
ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Private Code Context retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai