AI doesn't read words — it reads tokens. Learn how tokenization works, why code costs more than prose, and how to calculate your token usage.
Imagine you’re learning a new language. At first, you read letter by letter: C-A-T. But eventually, you learn to recognize whole words at a glance: “cat.” And then common phrases: “once upon a time.”
AI models do something similar — but instead of learning words, they learn tokens. A token is a chunk of text that the model has learned to recognize as a single unit. Sometimes a token is a whole word. Sometimes it’s part of a word. Sometimes it’s just a space or a punctuation mark.
Here’s the key insight: the AI never sees your text as words. It converts everything into tokens first, processes the tokens, then converts them back to text for you.
The most common tokenization method is called Byte-Pair Encoding (BPE). Here’s how it works, step by step:
Take the word “lowest”:
l, o, w, e, s, tThe algorithm looks at a huge training corpus and finds which pair of adjacent characters appears most often. Let’s say “es” is the most common pair.
Now “lowest” becomes:
l, o, w, es, tFind the next most common pair. Maybe “lo” is common:
lo, w, es, tKeep going. Eventually “low” becomes a token, then “lowest” might become a single token if it appears often enough.
Modern tokenizers have vocabularies of 50,000–100,000 tokens. The vocabulary size determines how the model maps between text and numbers:
Each token gets a unique integer ID. The sentence “Hello world” might become:
That’s it. Two numbers. That’s what the AI actually processes.
Let’s look at how different texts tokenize:
| Text | Tokens | Count |
|---|---|---|
| ”Hello” | [“Hello”] | 1 |
| ”hello” | [“hello”] | 1 |
| ”HELLO” | [“HE”, “LLO”] | 2 |
| ”Unconstitutional” | [“Un”, “const”, “itutional”] | 3-4 |
| ”AI” | [“AI”] | 1 |
| ”artificial intelligence” | [“artificial”, ” intelligence”] | 2 |
| ”🤖” | [”🤖“] | 1-2 |
Notice something? Capitalization matters. “HELLO” costs more tokens than “Hello” because all-caps words are less common in training data, so the tokenizer breaks them into smaller pieces.
This is a crucial insight for developers. Code has higher token density than English prose.
Here’s a simple Python function:
def calculate_average(numbers: list[float]) -> float:
if not numbers:
raise ValueError("Cannot calculate average of empty list")
return sum(numbers) / len(numbers)This is only 4 lines of code, but it tokenizes into roughly 40–50 tokens. Why?
(, ), :, [, ], -> is typically its own tokencalculate_average might become ["calculate", "_average"]list[float] is 4+ tokensLet’s compare token density:
# Count tokens per line for different content types
import tiktoken
encoder = tiktoken.get_encoding("cl100k_base")
english_paragraph = """
The quick brown fox jumps over the lazy dog. This sentence contains
every letter of the English alphabet. It has been used since at least
the late 19th century for testing typewriters and fonts.
"""
python_code = """
def fibonacci(n: int) -> list[int]:
if n <= 0:
return []
elif n == 1:
return [0]
fib = [0, 1]
for i in range(2, n):
fib.append(fib[i-1] + fib[i-2])
return fib
"""
json_data = """
{
"users": [
{"id": 1, "name": "Alice", "email": "alice@example.com"},
{"id": 2, "name": "Bob", "email": "bob@example.com"},
{"id": 3, "name": "Charlie", "email": "charlie@example.com"}
]
}
"""
for name, text in [("English", english_paragraph),
("Python", python_code),
("JSON", json_data)]:
tokens = encoder.encode(text)
words = len(text.split())
chars = len(text)
print(f"{name}:")
print(f" Characters: {chars}")
print(f" Words: {words}")
print(f" Tokens: {len(tokens)}")
print(f" Chars/Token: {chars/len(tokens):.1f}")
print(f" Tokens/Word: {len(tokens)/words:.1f}")
print()Typical results:
| Content Type | Chars/Token | Tokens/Word |
|---|---|---|
| English prose | 4.0–4.5 | 1.2–1.4 |
| Python code | 3.0–3.5 | 1.8–2.5 |
| JSON | 2.5–3.0 | 2.5–3.5 |
| XML/HTML | 2.0–2.5 | 3.0–4.0 |
JSON and XML are token-expensive. All those brackets, quotes, and colons add up. A 1KB JSON file uses roughly 1.5× more tokens than a 1KB English text file.
AI APIs charge per token. Here’s the pricing math:
Where and are the per-million-token prices.
For Claude Sonnet at 15/M output:
A request with 10,000 input tokens and 2,000 output tokens costs:
But here’s what trips people up: in a multi-turn conversation, input tokens accumulate. Every turn carries all previous context:
Total input tokens across 10 turns:
That’s 55,000 tokens for what felt like 10,000 tokens of conversation.
Not all models tokenize the same way. Here’s why:
| Model Family | Tokenizer | Vocab Size |
|---|---|---|
| GPT-4 / GPT-4o | cl100k_base / o200k_base | 100K / 200K |
| Claude | Claude tokenizer | ~100K |
| Llama 3 | Llama tokenizer | 128K |
| Gemini | SentencePiece | ~256K |
The same text can produce different token counts across models. “Hello world” might be 2 tokens in GPT-4 but 2 or 3 in another model.
import tiktoken
def count_tokens(text: str, model: str = "gpt-4") -> int:
"""Count the number of tokens in a text string."""
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
# Example usage
text = "What is the meaning of life?"
print(f"Token count: {count_tokens(text)}") # Output: 7// Using the js-tiktoken library
import { encodingForModel } from "js-tiktoken";
function countTokens(text, model = "gpt-4") {
const encoding = encodingForModel(model);
const tokens = encoding.encode(text);
return tokens.length;
}
console.log(countTokens("What is the meaning of life?")); // 7def estimate_tokens(text: str) -> int:
"""Quick estimate: ~4 characters per token for English."""
return len(text) // 4
def estimate_tokens_words(text: str) -> int:
"""Quick estimate: ~1.3 tokens per word for English."""
return int(len(text.split()) * 1.3)Beyond your text, models use special tokens that consume context space:
<|begin_of_text|> — marks the start of input<|end_of_text|> — marks the end<|start_header_id|> — marks role transitions (system, user, assistant)These special tokens are invisible to you but real in the token count. A simple “Hello” message might actually be 8–10 tokens after the framework adds role markers and formatting.
Here’s a surprising fact: tokenization is not fair across languages.
English text tokenizes efficiently because most tokenizers are trained predominantly on English text. Other languages — especially those with non-Latin scripts — use more tokens for the same meaning.
| Language | ”Hello, how are you?” equivalent | Approx. Tokens |
|---|---|---|
| English | ”Hello, how are you?“ | 6 |
| Spanish | ”Hola, ¿cómo estás?“ | 7 |
| Chinese | ”你好,你怎么样?“ | 8–11 |
| Arabic | ”مرحبا، كيف حالك؟“ | 10–15 |
| Hindi | ”नमस्ते, आप कैसे हैं?“ | 15–20 |
This means non-English users effectively get a smaller context window and pay more per word — a significant equity issue in AI.
Understanding tokens is the foundation for understanding everything else about AI performance, cost, and context management.
ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai