Certificate in AI Safety & Alignment

Program Overview

The Certificate in AI Safety & Alignment provides rigorous technical training in one of the most important challenges of our era: ensuring that increasingly capable AI systems behave as intended and remain beneficial as they scale. The program covers both near-term technical alignment work (RLHF, constitutional AI, interpretability) and longer-horizon considerations.

Curriculum Highlights

Alignment Fundamentals: The alignment problem, goal misgeneralization, deceptive alignment, reward hacking
Technical Approaches: RLHF, DPO, Constitutional AI (Anthropic), RLAIF, debate, amplification
Interpretability: Mechanistic interpretability, probing classifiers, activation patching, circuit analysis
Evaluation & Red-Teaming: Safety benchmarks, automated red-teaming, adversarial prompting, model evaluation frameworks
Governance Interface: How technical alignment connects to policy, standards, and deployment decisions

Sample Courses

SAF-101: Introduction to AI Safety and Alignment
SAF-201: Technical Alignment: RLHF, DPO, and Constitutional AI
SAF-210: Interpretability Methods and Tools
SAF-220: Red-Teaming and Safety Evaluation
SAF-290: Capstone: Alignment Research Project