By the end of this lecture, you will be able to: (1) implement the BPE algorithm from first principles, (2) train a vocabulary on a text corpus, (3) encode and decode text using a trained tokenizer, and (4) explain why tokenization choices affect downstream model capabilities.
1. Introduction: Why Build Your Own?
Most practitioners never implement a tokenizer — they call from transformers import AutoTokenizer and move on. But implementing one teaches you things you can't get from using a pre-built tokenizer. You'll understand exactly what a "token" is, why GPT-4 tokenizes "strawberry" differently than "apple," and why some languages cost more tokens than others. You'll also understand the Feste approach (tag1.com/how-to/ Part 1), which is exactly the approach we follow here, implemented in Rust.
This lecture uses Python for clarity, but the concepts are language-agnostic.
2. The Corpus and Character-Level Vocabulary
Start with a text corpus — for this lecture, we'll use a small sample:
corpus = [
"low low low low low",
"lower lower",
"newest newest newest",
"widest widest",
]
Step 1: Count word frequencies with end-of-word markers (BPE treats word boundaries as meaningful):
from collections import Counter
def get_word_freqs(corpus):
word_freqs = Counter()
for text in corpus:
for word in text.split():
# Add end-of-word marker
word_freqs[" ".join(list(word)) + " "] += 1
return word_freqs
word_freqs = get_word_freqs(corpus)
# {"l o w ": 5, "l o w e r ": 2, "n e w e s t ": 3, "w i d e s t ": 2}
Step 2: Build the initial vocabulary (all unique characters):
def get_vocab(word_freqs):
vocab = set()
for word in word_freqs:
for char in word.split():
vocab.add(char)
return vocab
vocab = get_vocab(word_freqs)
# {'l', 'o', 'w', '', 'e', 'r', 'n', 's', 't', 'i', 'd'}
3. The Merge Algorithm
BPE iteratively merges the most frequent adjacent pair of tokens. Here's the implementation:
def get_stats(word_freqs):
"""Count frequency of all adjacent pairs."""
pairs = Counter()
for word, freq in word_freqs.items():
symbols = word.split()
for i in range(len(symbols) - 1):
pairs[(symbols[i], symbols[i+1])] += freq
return pairs
def merge_vocab(pair, word_freqs):
"""Merge all occurrences of a pair in the vocabulary."""
new_vocab = {}
bigram = " ".join(pair)
replacement = "".join(pair)
for word, freq in word_freqs.items():
new_word = word.replace(bigram, replacement)
new_vocab[new_word] = freq
return new_vocab
# Training loop
merges = {}
num_merges = 10
for i in range(num_merges):
pairs = get_stats(word_freqs)
if not pairs:
break
# Find the most frequent pair
best_pair = max(pairs, key=pairs.get)
# Merge it
word_freqs = merge_vocab(best_pair, word_freqs)
merges[best_pair] = i
print(f"Merge {i+1}: {best_pair[0]} + {best_pair[1]} → {''.join(best_pair)}")
Running this on our corpus produces merge rules like:
Merge 1: e + s → es (frequency: 5)
Merge 2: es + t → est (frequency: 5)
Merge 3: l + o → lo (frequency: 7)
Merge 4: lo + w → low (frequency: 7)
Merge 5: n + e → ne (frequency: 3)
Merge 6: ne + w → new (frequency: 3)
Merge 7: new + est → newest (frequency: 3)
4. Encoding New Text
To tokenize new text, apply the merge rules in order:
def tokenize(word, merges):
"""Apply learned merges to tokenize a word."""
# Start as individual characters
tokens = list(word) + [""]
# Apply merges in order
for (left, right), _ in sorted(merges.items(), key=lambda x: x[1]):
i = 0
while i len(tokens) - 1:
if tokens[i] == left and tokens[i+1] == right:
tokens = tokens[:i] + ["".join([left, right])] + tokens[i+2:]
else:
i += 1
return tokens
print(tokenize("lowest", merges))
# ['low', 'es', 't', ''] — "low" is known, but "lowest" isn't a learned merge
print(tokenize("newest", merges))
# ['newest'] — fully merged!
5. Vocabulary Size and Its Consequences
In production tokenizers, you run thousands of merge steps until you reach your target vocabulary size. GPT-2 uses 50,257 tokens; Llama 3 uses 128,000. Larger vocabulary = more tokens encoded as single units = shorter sequences = cheaper inference. But larger vocabulary also means the embedding matrix is larger (vocab_size × embedding_dim).
The tradeoff:
| Vocab Size | Avg Tokens/Word | Embedding Matrix (768-dim) |
|---|---|---|
| 8,000 | ~1.8 | 6.1M params |
| 32,000 | ~1.3 | 24.6M params |
| 128,000 | ~1.1 | 98.3M params |
6. Special Tokens
Every tokenizer has special tokens with predefined IDs that the model is trained to treat specially:
<|endoftext|>(GPT): marks end of a document[CLS],[SEP](BERT): classification token and separator<|im_start|>,<|im_end|>(ChatML format): structure conversational turns<unk>: unknown token (rare in BPE which covers all bytes)
In the Feste implementation (tag1.com/how-to/ Part 1), special token handling is explicit: the tokenizer distinguishes between "merge-generated tokens" and "special tokens" at the encoding level, which matters when implementing attention masking.
7. The Rust Perspective
The Tag1 Feste implementation of this algorithm in Rust is worth examining. The merge rules file is a flat text file: one merge rule per line, in order. The encoding function applies rules greedily using a hash map for O(1) lookup. The key insight from the Rust implementation: the tokenizer is not magic. It's a sequence of string substitution rules stored in a file and applied in order. That's it.
Implement the BPE tokenizer in Python. Train it on a small corpus of your choice (try 10 sentences from Wikipedia). Run 50 merge steps. Examine which pairs got merged first and explain why.
Take the trained tokenizer from Exercise 1 and compute the "fertility" (average characters per token) for English, French, and Japanese text samples of equal character length. What do you observe? Why?
Modify the BPE implementation to support byte-level tokenization (instead of character-level). This is what GPT-2's tokenizer uses. Hint: convert text to bytes first, then run BPE over byte sequences. What changes?
Key Takeaways
- BPE is an algorithm: iteratively merge the most frequent adjacent pair until target vocabulary size is reached
- The merge rule file is the tokenizer — encoding is just applying these rules in order
- Vocabulary size is a tradeoff between sequence length, model size, and multilingual coverage
- Tokenization shapes model capabilities: arithmetic, spelling, and multilingual performance are all affected