Program Overview
The Certificate in AI Safety & Alignment provides rigorous technical training in one of the most important challenges of our era: ensuring that increasingly capable AI systems behave as intended and remain beneficial as they scale. The program covers both near-term technical alignment work (RLHF, constitutional AI, interpretability) and longer-horizon considerations.
Curriculum Highlights
- Alignment Fundamentals: The alignment problem, goal misgeneralization, deceptive alignment, reward hacking
- Technical Approaches: RLHF, DPO, Constitutional AI (Anthropic), RLAIF, debate, amplification
- Interpretability: Mechanistic interpretability, probing classifiers, activation patching, circuit analysis
- Evaluation & Red-Teaming: Safety benchmarks, automated red-teaming, adversarial prompting, model evaluation frameworks
- Governance Interface: How technical alignment connects to policy, standards, and deployment decisions
Sample Courses
- SAF-101: Introduction to AI Safety and Alignment
- SAF-201: Technical Alignment: RLHF, DPO, and Constitutional AI
- SAF-210: Interpretability Methods and Tools
- SAF-220: Red-Teaming and Safety Evaluation
- SAF-290: Capstone: Alignment Research Project