Lecture: Building a Tokenizer From Scratch

Building a tokenizer — code and text processing

Learning Objectives

By the end of this lecture, you will be able to: (1) implement the BPE algorithm from first principles, (2) train a vocabulary on a text corpus, (3) encode and decode text using a trained tokenizer, and (4) explain why tokenization choices affect downstream model capabilities.

1. Introduction: Why Build Your Own?

Most practitioners never implement a tokenizer — they call from transformers import AutoTokenizer and move on. But implementing one teaches you things you can't get from using a pre-built tokenizer. You'll understand exactly what a "token" is, why GPT-4 tokenizes "strawberry" differently than "apple," and why some languages cost more tokens than others. You'll also understand the Feste approach (tag1.com/how-to/ Part 1), which is exactly the approach we follow here, implemented in Rust.

This lecture uses Python for clarity, but the concepts are language-agnostic.

2. The Corpus and Character-Level Vocabulary

Start with a text corpus — for this lecture, we'll use a small sample:

corpus = [
"low low low low low",
"lower lower",
"newest newest newest",
"widest widest",
]

Step 1: Count word frequencies with end-of-word markers (BPE treats word boundaries as meaningful):

from collections import Counter

def get_word_freqs(corpus):
word_freqs = Counter()
for text in corpus:
    for word in text.split():
        # Add end-of-word marker
        word_freqs[" ".join(list(word)) + " "] += 1
return word_freqs

word_freqs = get_word_freqs(corpus)
# {"l o w ": 5, "l o w e r ": 2, "n e w e s t ": 3, "w i d e s t ": 2}

Step 2: Build the initial vocabulary (all unique characters):

def get_vocab(word_freqs):
vocab = set()
for word in word_freqs:
    for char in word.split():
        vocab.add(char)
return vocab

vocab = get_vocab(word_freqs)
# {'l', 'o', 'w', '', 'e', 'r', 'n', 's', 't', 'i', 'd'}

3. The Merge Algorithm

BPE iteratively merges the most frequent adjacent pair of tokens. Here's the implementation:

def get_stats(word_freqs):
"""Count frequency of all adjacent pairs."""
pairs = Counter()
for word, freq in word_freqs.items():
    symbols = word.split()
    for i in range(len(symbols) - 1):
        pairs[(symbols[i], symbols[i+1])] += freq
return pairs

def merge_vocab(pair, word_freqs):
"""Merge all occurrences of a pair in the vocabulary."""
new_vocab = {}
bigram = " ".join(pair)
replacement = "".join(pair)
for word, freq in word_freqs.items():
    new_word = word.replace(bigram, replacement)
    new_vocab[new_word] = freq
return new_vocab

# Training loop
merges = {}
num_merges = 10

for i in range(num_merges):
pairs = get_stats(word_freqs)
if not pairs:
    break
# Find the most frequent pair
best_pair = max(pairs, key=pairs.get)
# Merge it
word_freqs = merge_vocab(best_pair, word_freqs)
merges[best_pair] = i
print(f"Merge {i+1}: {best_pair[0]} + {best_pair[1]} → {''.join(best_pair)}")

Running this on our corpus produces merge rules like:

Merge 1: e + s → es     (frequency: 5)
Merge 2: es + t → est    (frequency: 5)
Merge 3: l + o → lo      (frequency: 7)
Merge 4: lo + w → low    (frequency: 7)
Merge 5: n + e → ne      (frequency: 3)
Merge 6: ne + w → new    (frequency: 3)
Merge 7: new + est → newest (frequency: 3)

4. Encoding New Text

To tokenize new text, apply the merge rules in order:

def tokenize(word, merges):
"""Apply learned merges to tokenize a word."""
# Start as individual characters
tokens = list(word) + [""]

# Apply merges in order
for (left, right), _ in sorted(merges.items(), key=lambda x: x[1]):
    i = 0
    while i  len(tokens) - 1:
        if tokens[i] == left and tokens[i+1] == right:
            tokens = tokens[:i] + ["".join([left, right])] + tokens[i+2:]
        else:
            i += 1
return tokens

print(tokenize("lowest", merges))
# ['low', 'es', 't', '']  — "low" is known, but "lowest" isn't a learned merge

print(tokenize("newest", merges))
# ['newest']  — fully merged!

5. Vocabulary Size and Its Consequences

In production tokenizers, you run thousands of merge steps until you reach your target vocabulary size. GPT-2 uses 50,257 tokens; Llama 3 uses 128,000. Larger vocabulary = more tokens encoded as single units = shorter sequences = cheaper inference. But larger vocabulary also means the embedding matrix is larger (vocab_size × embedding_dim).

The tradeoff:

Vocab Size	Avg Tokens/Word	Embedding Matrix (768-dim)
8,000	~1.8	6.1M params
32,000	~1.3	24.6M params
128,000	~1.1	98.3M params

6. Special Tokens

Every tokenizer has special tokens with predefined IDs that the model is trained to treat specially:

<|endoftext|> (GPT): marks end of a document
[CLS], [SEP] (BERT): classification token and separator
<|im_start|>, <|im_end|> (ChatML format): structure conversational turns
<unk>: unknown token (rare in BPE which covers all bytes)

In the Feste implementation (tag1.com/how-to/ Part 1), special token handling is explicit: the tokenizer distinguishes between "merge-generated tokens" and "special tokens" at the encoding level, which matters when implementing attention masking.

7. The Rust Perspective

The Tag1 Feste implementation of this algorithm in Rust is worth examining. The merge rules file is a flat text file: one merge rule per line, in order. The encoding function applies rules greedily using a hash map for O(1) lookup. The key insight from the Rust implementation: the tokenizer is not magic. It's a sequence of string substitution rules stored in a file and applied in order. That's it.

Exercise 1

Implement the BPE tokenizer in Python. Train it on a small corpus of your choice (try 10 sentences from Wikipedia). Run 50 merge steps. Examine which pairs got merged first and explain why.

Exercise 2

Take the trained tokenizer from Exercise 1 and compute the "fertility" (average characters per token) for English, French, and Japanese text samples of equal character length. What do you observe? Why?

Exercise 3 (Advanced)

Modify the BPE implementation to support byte-level tokenization (instead of character-level). This is what GPT-2's tokenizer uses. Hint: convert text to bytes first, then run BPE over byte sequences. What changes?

Key Takeaways

BPE is an algorithm: iteratively merge the most frequent adjacent pair until target vocabulary size is reached
The merge rule file is the tokenizer — encoding is just applying these rules in order
Vocabulary size is a tradeoff between sequence length, model size, and multilingual coverage
Tokenization shapes model capabilities: arithmetic, spelling, and multilingual performance are all affected