LLM-401: Transformer Architecture and Attention Mechanisms

Course Description

This foundational course covers the transformer architecture in depth, from mathematical first principles through efficient implementation. Students gain the understanding needed to read and extend modern LLM codebases, not just use them. Weekly labs implement key components from scratch in PyTorch.

Learning Objectives

Derive scaled dot-product attention from information retrieval first principles
Implement multi-head attention, residual connections, and layer normalization
Analyze the computational complexity of attention and its practical implications
Compare architectural variants: encoder-only (BERT), decoder-only (GPT), encoder-decoder (T5)
Understand Flash Attention, grouped query attention, and memory-efficient implementations