Course Description
Tokenization is the first and often most overlooked decision in building a language model. This course examines how tokenization shapes what an LLM can learn, what arithmetic it can perform, and how well it handles multilingual text. Students implement BPE from scratch, analyze tokenizer behavior across languages, and understand the direct connection between vocabulary choice and model capabilities.
Key Topics
- BPE algorithm: merge rules, vocabulary construction, encoding and decoding
- WordPiece and SentencePiece: differences and use cases
- Tokenization in Llama, GPT-4, and multilingual models
- Tokenizer fertility: how tokenizer choice affects different languages
- The Feste tokenizer (Rust implementation): a walkthrough of production-quality BPE