Introduction
When we talk about building a language model "from scratch," the very first challenge is deceptively simple: computers work with numbers, not words. Before a transformer can do anything useful, every piece of text must be converted into a sequence of integers. This conversion process — tokenization — is not just a preprocessing step. It shapes the model's vocabulary, its memory requirements, and ultimately what it can learn.
This lesson covers why tokenization is necessary, what the alternatives look like, and why Byte Pair Encoding (BPE) — the strategy used by GPT-2 and Feste — strikes the right balance for a practical language model.
What We're Building: Feste
The Feste project is a GPT-2 style transformer trained entirely on Shakespeare's complete works, implemented from scratch in Rust with no external ML libraries. The name comes from Feste, the witty fool in Twelfth Night — an apt description for a small model that generates plausible-sounding but often absurd Shakespearean text.
The complete source is at github.com/jeremyandrews/feste. Throughout this course we'll trace every component from first principles: tokenizer → tensor operations → model architecture → training → experiments.
Why We Can't Feed Raw Text to a Neural Network
Neural networks operate on tensors — multidimensional arrays of floating-point numbers. "The quick brown fox" is not a tensor. To make it one, we need a mapping from text units to integers, and then from integers to embedding vectors.
The question is: what unit of text should we tokenize? The three main options are:
- Character-level: Each character is one token. Vocabulary is tiny (~100 entries for ASCII), but sequences become very long. "Shakespeare" is 11 tokens. The model must learn to combine characters into words at every layer.
- Word-level: Each word is one token. Vocabulary explodes — English has 170,000+ words, and rare words never seen in training get no representation at all. "uncharacteristically" and "uncharacteristically." are different tokens.
- Subword (BPE, WordPiece, etc.): Frequently occurring character sequences are merged into single tokens. Common words like "the" and "and" become single tokens; rare words are split into recognizable pieces. "tokenization" might become ["token", "ization"].
Why BPE?
Byte Pair Encoding was originally developed for data compression. Its core insight: find the most frequently co-occurring pair of symbols in a corpus and merge them into a new symbol. Repeat until you reach your target vocabulary size.
For language models, BPE has three key advantages:
- Open vocabulary: Unknown words at inference time can still be represented as sequences of known subword tokens. A model trained without "CUDA" in its vocabulary can still tokenize it as ["CU", "DA"] or similar.
- Compression efficiency: Common words become single tokens, so a typical sentence uses far fewer tokens than character-level encoding. This means the model attends over shorter sequences for the same amount of text.
- Morphological awareness: Related words share token prefixes. "train", "trainer", "training" all start with "train", so the model sees the relationship encoded in the tokenization itself.
GPT-2 uses a BPE vocabulary of 50,257 tokens. Feste uses a smaller vocabulary (configurable, typically 8,000–16,000) appropriate for its training corpus.
Vocabulary Size Trade-offs
Choosing a vocabulary size involves a genuine trade-off:
- A smaller vocabulary means more tokens per sentence (longer sequences), cheaper embedding tables, but the model must work harder to understand word structure.
- A larger vocabulary means fewer tokens per sentence (shorter sequences, less memory in attention), but the embedding table and output projection matrix become massive.
For Feste's Shakespeare corpus (~1MB of text), a vocabulary of 8,000–10,000 tokens provides good coverage without over-fitting the vocabulary to the training domain.
Key Takeaways
- Tokenization bridges raw text and neural network math; without it, LLMs cannot process language.
- Character-level tokenization has tiny vocabulary but very long sequences; word-level has short sequences but a vocabulary explosion and no handling of unknown words.
- BPE (Byte Pair Encoding) finds a pragmatic middle ground: subword units that compress common patterns while remaining extensible to unseen text.
- Vocabulary size is a hyperparameter with real memory and sequence-length trade-offs.
Adapted from Part 1 of "Building an LLM From Scratch in Rust" by Jeremy Andrews, published on tag1.com