Course Description
Alignment transforms a capable language model into one that is helpful, harmless, and honest. This course examines the technical mechanisms behind modern alignment: RLHF, its practical challenges, and newer approaches like DPO that achieve similar results more efficiently. Students implement a full RLHF pipeline on a small model and analyze alignment failures in case studies.