Certificate in AI Safety & Alignment

Program Overview

The Certificate in AI Safety & Alignment provides rigorous technical training in one of the most important challenges of our era: ensuring that increasingly capable AI systems behave as intended and remain beneficial as they scale. The program covers both near-term technical alignment work (RLHF, constitutional AI, interpretability) and longer-horizon considerations.

Curriculum Highlights

  • Alignment Fundamentals: The alignment problem, goal misgeneralization, deceptive alignment, reward hacking
  • Technical Approaches: RLHF, DPO, Constitutional AI (Anthropic), RLAIF, debate, amplification
  • Interpretability: Mechanistic interpretability, probing classifiers, activation patching, circuit analysis
  • Evaluation & Red-Teaming: Safety benchmarks, automated red-teaming, adversarial prompting, model evaluation frameworks
  • Governance Interface: How technical alignment connects to policy, standards, and deployment decisions

Sample Courses

  • SAF-101: Introduction to AI Safety and Alignment
  • SAF-201: Technical Alignment: RLHF, DPO, and Constitutional AI
  • SAF-210: Interpretability Methods and Tools
  • SAF-220: Red-Teaming and Safety Evaluation
  • SAF-290: Capstone: Alignment Research Project