LLM-402: Tokenization, Vocabulary, and Encoding Schemes

Course Description

Tokenization is the first and often most overlooked decision in building a language model. This course examines how tokenization shapes what an LLM can learn, what arithmetic it can perform, and how well it handles multilingual text. Students implement BPE from scratch, analyze tokenizer behavior across languages, and understand the direct connection between vocabulary choice and model capabilities.

Key Topics

BPE algorithm: merge rules, vocabulary construction, encoding and decoding
WordPiece and SentencePiece: differences and use cases
Tokenization in Llama, GPT-4, and multilingual models
Tokenizer fertility: how tokenizer choice affects different languages
The Feste tokenizer (Rust implementation): a walkthrough of production-quality BPE