Course Description
Technical methods for aligning language models with human values. RLHF pipeline: SFT, reward model training, PPO alignment. DPO (Direct Preference Optimization): derivation from RLHF, advantages, and limitations. Constitutional AI: Anthropic's approach to scalable oversight via AI feedback (RLAIF). Reward hacking examples and how to detect them. Scalable oversight: debate and amplification. Students run a DPO fine-tuning experiment and analyze alignment failures in public models.