Course Description
This foundational course covers the transformer architecture in depth, from mathematical first principles through efficient implementation. Students gain the understanding needed to read and extend modern LLM codebases, not just use them. Weekly labs implement key components from scratch in PyTorch.
Learning Objectives
- Derive scaled dot-product attention from information retrieval first principles
- Implement multi-head attention, residual connections, and layer normalization
- Analyze the computational complexity of attention and its practical implications
- Compare architectural variants: encoder-only (BERT), decoder-only (GPT), encoder-decoder (T5)
- Understand Flash Attention, grouped query attention, and memory-efficient implementations