Tokens Explained: Why Your AI Counts Words Differently Than You Do

AI doesn't read words — it reads tokens. Learn how tokenization works, why code costs more than prose, and how to calculate your token usage.

Tokens Explained: Why Your AI Counts Words Differently Than You Do

Tokens Explained: Why Your AI Counts Words Differently Than You Do

The Simple Version: What Is a Token?

Imagine you’re learning a new language. At first, you read letter by letter: C-A-T. But eventually, you learn to recognize whole words at a glance: “cat.” And then common phrases: “once upon a time.”

AI models do something similar — but instead of learning words, they learn tokens. A token is a chunk of text that the model has learned to recognize as a single unit. Sometimes a token is a whole word. Sometimes it’s part of a word. Sometimes it’s just a space or a punctuation mark.

Here’s the key insight: the AI never sees your text as words. It converts everything into tokens first, processes the tokens, then converts them back to text for you.

How Tokenization Works: BPE

The most common tokenization method is called Byte-Pair Encoding (BPE). Here’s how it works, step by step:

Step 1: Start with Individual Characters

Take the word “lowest”:

l, o, w, e, s, t

Step 2: Find the Most Common Pair

The algorithm looks at a huge training corpus and finds which pair of adjacent characters appears most often. Let’s say “es” is the most common pair.

Step 3: Merge That Pair into a New Token

Now “lowest” becomes:

l, o, w, es, t

Step 4: Repeat

Find the next most common pair. Maybe “lo” is common:

lo, w, es, t

Keep going. Eventually “low” becomes a token, then “lowest” might become a single token if it appears often enough.

The Math of Vocabulary Size

Modern tokenizers have vocabularies of 50,000–100,000 tokens. The vocabulary size VV determines how the model maps between text and numbers:

Token ID{0,1,2,,V1}\text{Token ID} \in \{0, 1, 2, \ldots, V-1\}

Each token gets a unique integer ID. The sentence “Hello world” might become:

[15496,995][15496, 995]

That’s it. Two numbers. That’s what the AI actually processes.

Token ≠ Word: The Examples

Let’s look at how different texts tokenize:

TextTokensCount
”Hello”[“Hello”]1
”hello”[“hello”]1
”HELLO”[“HE”, “LLO”]2
”Unconstitutional”[“Un”, “const”, “itutional”]3-4
”AI”[“AI”]1
”artificial intelligence”[“artificial”, ” intelligence”]2
”🤖”[”🤖“]1-2

Notice something? Capitalization matters. “HELLO” costs more tokens than “Hello” because all-caps words are less common in training data, so the tokenizer breaks them into smaller pieces.

Why Code Is More Expensive Than Prose

This is a crucial insight for developers. Code has higher token density than English prose.

Here’s a simple Python function:

def calculate_average(numbers: list[float]) -> float:
    if not numbers:
        raise ValueError("Cannot calculate average of empty list")
    return sum(numbers) / len(numbers)

This is only 4 lines of code, but it tokenizes into roughly 40–50 tokens. Why?

  1. Punctuation is expensive. Every (, ), :, [, ], -> is typically its own token
  2. Indentation costs tokens. Spaces and tabs are tokenized too
  3. CamelCase and snake_case split. calculate_average might become ["calculate", "_average"]
  4. Type hints add tokens. list[float] is 4+ tokens

Let’s compare token density:

# Count tokens per line for different content types
import tiktoken

encoder = tiktoken.get_encoding("cl100k_base")

english_paragraph = """
The quick brown fox jumps over the lazy dog. This sentence contains
every letter of the English alphabet. It has been used since at least
the late 19th century for testing typewriters and fonts.
"""

python_code = """
def fibonacci(n: int) -> list[int]:
    if n <= 0:
        return []
    elif n == 1:
        return [0]
    fib = [0, 1]
    for i in range(2, n):
        fib.append(fib[i-1] + fib[i-2])
    return fib
"""

json_data = """
{
    "users": [
        {"id": 1, "name": "Alice", "email": "alice@example.com"},
        {"id": 2, "name": "Bob", "email": "bob@example.com"},
        {"id": 3, "name": "Charlie", "email": "charlie@example.com"}
    ]
}
"""

for name, text in [("English", english_paragraph),
                    ("Python", python_code),
                    ("JSON", json_data)]:
    tokens = encoder.encode(text)
    words = len(text.split())
    chars = len(text)
    print(f"{name}:")
    print(f"  Characters: {chars}")
    print(f"  Words: {words}")
    print(f"  Tokens: {len(tokens)}")
    print(f"  Chars/Token: {chars/len(tokens):.1f}")
    print(f"  Tokens/Word: {len(tokens)/words:.1f}")
    print()

Typical results:

Content TypeChars/TokenTokens/Word
English prose4.0–4.51.2–1.4
Python code3.0–3.51.8–2.5
JSON2.5–3.02.5–3.5
XML/HTML2.0–2.53.0–4.0

JSON and XML are token-expensive. All those brackets, quotes, and colons add up. A 1KB JSON file uses roughly 1.5× more tokens than a 1KB English text file.

The Token Economy: Why This Matters for Your Bill

AI APIs charge per token. Here’s the pricing math:

Cost=Input Tokens1,000,000×Pinput+Output Tokens1,000,000×Poutput\text{Cost} = \frac{\text{Input Tokens}}{1{,}000{,}000} \times P_{\text{input}} + \frac{\text{Output Tokens}}{1{,}000{,}000} \times P_{\text{output}}

Where PinputP_{\text{input}} and PoutputP_{\text{output}} are the per-million-token prices.

For Claude Sonnet at 3/Minputand3/M input and15/M output:

A request with 10,000 input tokens and 2,000 output tokens costs:

Cost=10,0001,000,000×3+2,0001,000,000×15=$0.03+$0.03=$0.06\text{Cost} = \frac{10{,}000}{1{,}000{,}000} \times 3 + \frac{2{,}000}{1{,}000{,}000} \times 15 = \$0.03 + \$0.03 = \$0.06

But here’s what trips people up: in a multi-turn conversation, input tokens accumulate. Every turn carries all previous context:

Total input tokens across 10 turns:

Total=k=110k×1,000=1,000×10×112=55,000\text{Total} = \sum_{k=1}^{10} k \times 1{,}000 = 1{,}000 \times \frac{10 \times 11}{2} = 55{,}000

That’s 55,000 tokens for what felt like 10,000 tokens of conversation.

Different Models, Different Tokenizers

Not all models tokenize the same way. Here’s why:

  1. Different training data → Different token frequencies → Different merges in BPE
  2. Different vocabulary sizes → More tokens = finer granularity but larger embedding tables
Model FamilyTokenizerVocab Size
GPT-4 / GPT-4ocl100k_base / o200k_base100K / 200K
ClaudeClaude tokenizer~100K
Llama 3Llama tokenizer128K
GeminiSentencePiece~256K

The same text can produce different token counts across models. “Hello world” might be 2 tokens in GPT-4 but 2 or 3 in another model.

How to Count Tokens: Practical Tools

Python (for OpenAI-compatible tokenizers)

import tiktoken

def count_tokens(text: str, model: str = "gpt-4") -> int:
    """Count the number of tokens in a text string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

# Example usage
text = "What is the meaning of life?"
print(f"Token count: {count_tokens(text)}")  # Output: 7

JavaScript (for web applications)

// Using the js-tiktoken library
import { encodingForModel } from "js-tiktoken";

function countTokens(text, model = "gpt-4") {
  const encoding = encodingForModel(model);
  const tokens = encoding.encode(text);
  return tokens.length;
}

console.log(countTokens("What is the meaning of life?")); // 7

Quick Estimation (No Library Needed)

def estimate_tokens(text: str) -> int:
    """Quick estimate: ~4 characters per token for English."""
    return len(text) // 4

def estimate_tokens_words(text: str) -> int:
    """Quick estimate: ~1.3 tokens per word for English."""
    return int(len(text.split()) * 1.3)

Special Tokens: The Hidden Costs

Beyond your text, models use special tokens that consume context space:

These special tokens are invisible to you but real in the token count. A simple “Hello” message might actually be 8–10 tokens after the framework adds role markers and formatting.

The Tokenization Inequality

Here’s a surprising fact: tokenization is not fair across languages.

English text tokenizes efficiently because most tokenizers are trained predominantly on English text. Other languages — especially those with non-Latin scripts — use more tokens for the same meaning.

Language”Hello, how are you?” equivalentApprox. Tokens
English”Hello, how are you?“6
Spanish”Hola, ¿cómo estás?“7
Chinese”你好,你怎么样?“8–11
Arabic”مرحبا، كيف حالك؟“10–15
Hindi”नमस्ते, आप कैसे हैं?“15–20

This means non-English users effectively get a smaller context window and pay more per word — a significant equity issue in AI.

Key Takeaways

  1. Tokens ≠ words. One word can be 1–4+ tokens depending on complexity and language.
  2. Code costs more. Python uses ~2× more tokens per “word” than English prose. JSON and XML are even worse.
  3. Your bill grows quadratically in multi-turn conversations because input tokens accumulate.
  4. Different models count differently. Always use the right tokenizer for your model.
  5. Non-English text is more expensive. This is a fundamental property of BPE tokenization, not a pricing choice.

Understanding tokens is the foundation for understanding everything else about AI performance, cost, and context management.


ByteBell helps engineering teams solve exactly this problem. Instead of stuffing everything into the context window, ByteBell’s Smart Context Refresh retrieves only what matters — keeping your AI sharp, fast, and accurate. Learn more at bytebell.ai