How Tokenization Shapes What an LLM Can Learn

The Invisible Foundation

Before a large language model processes a single word, it processes tokens. Tokenization — the process of splitting text into chunks a model can work with — is among the most consequential and least-discussed decisions in LLM design. The way a tokenizer carves up language shapes what arithmetic the model can perform, which languages it handles gracefully, whether it can spell, and even how it reasons about code.

Most practitioners use a pre-trained tokenizer without much thought. But if you want to understand why GPT-3 can't reliably count characters, why some models handle Japanese better than others, or why coding models often tokenize differently from chat models, you need to understand how tokenization works and what it constrains.

What a Token Is (and Isn't)

A token is not a word. It is not a character. It is a unit determined by the tokenizer's training process — typically a subword unit that appears frequently enough in the training corpus to earn its own entry in the vocabulary.

In GPT-2's vocabulary (50,257 tokens), common English words like "the", "of", and "and" are single tokens. Less common words are split into multiple tokens: "tokenization" becomes ["token", "ization"]. Very rare words fragment further: "Φωτιά" (Greek for fire) might become 7 or 8 tokens using a tokenizer trained primarily on English text.

The vocabulary size matters enormously. A vocabulary of 32,000 tokens (like Llama 2) uses 16-bit token IDs. A vocabulary of 100,000+ tokens (like some multilingual models) requires more memory per token but can represent more content with fewer tokens. The sweet spot depends on the target language distribution.

Byte-Pair Encoding: How Vocabularies Are Built

The most widely used tokenization algorithm is Byte-Pair Encoding (BPE), originally developed for data compression and adapted for NLP in 2016. The algorithm starts with individual bytes (or characters) and iteratively merges the most frequent adjacent pair until the desired vocabulary size is reached.

Starting vocabulary: {h, e, l, o, w, r, d}

After learning "el" is frequent: {h, el, l, o, w, r, d, el}

After learning "ello" is frequent: {h, ello, l, o, w, r, d, ello}

After learning "hello" is frequent: {hello, h, ello, l, o, w, r, d}

The merge rules, applied in order, constitute the tokenizer. Encoding new text means applying these rules left-to-right to find the longest match at each position. The Tag1 Feste LLM engine (documented at tag1.com/how-to/, Part 1) includes a detailed Rust implementation of BPE that demonstrates exactly how these merge rules work in practice.

The Fertility Problem

Tokenizer "fertility" measures how many tokens a given amount of text requires. For English text, a well-trained BPE tokenizer typically produces about 4 characters per token. For Spanish, it's similar. For Arabic, Hebrew, or Chinese, the ratio is often much worse — 1.5–2.5 characters per token — because these languages are underrepresented in training corpora, so their subwords don't appear frequently enough to earn their own vocabulary entries.

High fertility has three downstream consequences: higher inference cost (more tokens = more compute), shorter effective context length (a context window that holds 4,000 English words might hold only 2,000 Arabic words), and worse model performance (the model is asked to do more with less information per token).

This is a concrete way that the choice of tokenizer encodes a choice about which languages get high-quality service. Multilingual models like mBERT, XLM-R, and the Llama 3 family use expanded vocabularies and balanced training corpora to improve fertility across languages.

Tokenization and Arithmetic

LLMs famously struggle with arithmetic. Part of why is tokenization. Consider: "1234" might be tokenized as a single token. "12345" might be tokenized as ["123", "45"]. "999 + 1" might tokenize as ["999", " +", " 1"]. The model has no way to reliably "see" the individual digits that arithmetic requires unless they're separate tokens.

Models trained heavily on code (like DeepSeek Coder) often tokenize numbers digit-by-digit for exactly this reason. Language-focused models may not. The practical takeaway: if you're building a system that requires arithmetic reasoning, examine how your tokenizer handles numbers.

Tokenization and Spelling

Ask GPT-3 to count the letters in "strawberry" and it will often get it wrong. Why? Because "strawberry" is typically a single token, and the model never "sees" the individual letters during processing. It learns statistical patterns about what follows "strawberry" in text, not what letters compose it. For the same reason, LLMs struggle to reverse strings, identify the third letter of a word, or perform any character-level operation — these tasks require access to information that tokenization discards.

This is a fundamental limitation, not a quirk. It explains why newer models that tokenize at the character or byte level, or that use special tokenization schemes for letter-level tasks, perform dramatically better on these benchmarks.

SentencePiece and WordPiece

BPE is not the only game in town. SentencePiece (used by T5, ALBERT, and many multilingual models) implements BPE and Unigram Language Model tokenization in a single framework. It treats the input as a byte stream rather than Unicode code points, avoiding the need for language-specific pre-tokenization rules. This makes it more portable across writing systems.

WordPiece (used in BERT and early Google models) is similar to BPE but uses a likelihood-based merge criterion rather than frequency-based. In practice, the differences in downstream performance are modest for English but can be significant for morphologically complex languages.

Practical Implications for Engineers

If you're building with LLMs, the tokenizer matters in several practical ways:

  • Cost: Token count = API cost. For multilingual use cases, measure the token-to-character ratio in your actual content. It may be much higher than you expect.
  • Context: A 128K context window's effective capacity depends heavily on language. Budget accordingly.
  • Prompt design: Some prompt patterns require token efficiency. Others need specific tokens to be atomic (not split). Understanding your tokenizer lets you optimize for both.
  • Special tokens: Every model has special tokens ([CLS], [SEP], <|endoftext|>, <|im_start|>, etc.) that affect how prompts are parsed. Know your tokenizer's special tokens before building production systems.

Further Reading

For a complete implementation walkthrough, see the Feste blog series at tag1.com/how-to/ (Part 1 covers tokenizer implementation in Rust). For the original SentencePiece paper, see Kudo & Richardson (2018). For multilingual tokenization analysis, see the XTREME benchmark paper and analyses of fertility across languages.