RLHF vs DPO: Aligning Language Models with Human Preferences

The Alignment Tax

A pretrained language model is a powerful but wild thing: it will complete any text, whether the completion is helpful, harmful, or nonsensical. To make it useful, you have to align it — train it to be helpful and to avoid harmful outputs. The two dominant methods for doing this are Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO). Understanding both — when each works, why they differ, and what they share — is essential for anyone building or evaluating production AI systems.

RLHF: The Three-Stage Pipeline

RLHF consists of three stages:

  1. Supervised Fine-Tuning (SFT): Train the model on a curated set of (prompt, high-quality response) pairs to establish a baseline of good behavior.
  2. Reward Model Training: Collect human preference data: for the same prompt, present two model outputs and ask which is better. Train a separate reward model to predict human preferences from (prompt, response) pairs.
  3. RL Training: Use the reward model as a reward signal to fine-tune the SFT model via reinforcement learning (typically PPO). The policy (language model) learns to generate outputs that the reward model scores highly, while a KL divergence penalty prevents it from drifting too far from the SFT model.

RLHF's appeal is that it trains directly on human preferences, not on human demonstrations. This allows the model to discover outputs better than anything in the training data — a model trained with RLHF can theoretically learn to write better than its best training examples if the reward model's judgments are accurate.

RLHF's Problems

Despite its successes (ChatGPT, Claude), RLHF has significant practical challenges:

  • Reward hacking: The policy learns to game the reward model rather than truly satisfy human preferences. Sycophancy — agreeing with the user regardless of accuracy — is a classic example. The reward model likes confident, agreeable responses; the policy learns to produce them.
  • Instability: PPO is finicky. Hyperparameter sensitivity is high; KL coefficient must be tuned carefully; training can collapse or diverge.
  • Cost: Three separate training stages, plus the overhead of running the reward model during PPO training, makes RLHF expensive compared to supervised learning.
  • Off-policy issues: As the policy drifts during training, the reward model is increasingly asked to score outputs from a distribution it wasn't trained on.

DPO: Solving the Preference Problem Directly

Direct Preference Optimization (Rafailov et al., 2023) achieves comparable alignment quality to RLHF without the RL stage. The key insight: there is a closed-form relationship between the reward model and the optimal policy in RLHF. Given that relationship, you can directly optimize the policy to satisfy the preference data, bypassing the need to explicitly train a reward model or run PPO.

The DPO loss for a preference pair (chosen, rejected) is:

L_DPO = -log sigmoid(β (log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x)))

Where π_θ is the policy being trained, π_ref is the reference model, y_w is the preferred ("won") response, y_l is the dispreferred ("lost") response, and β controls the KL penalty. This is a standard supervised classification loss on preference pairs — no RL required.

DPO's Tradeoffs

DPO is simpler and more stable than RLHF. It trains in one stage (assuming SFT is already done), requires only one model in memory at training time, and is far less sensitive to hyperparameters. It has become the default alignment method for open-source model fine-tuning.

Where DPO falls short: it cannot discover better responses than those in the training data. RLHF's reward model can evaluate novel generations; DPO only learns from the preference pairs you provide. This means DPO is better suited to improving an already-reasonable model than to transforming a poor one into a great one.

Several DPO variants address specific limitations: IPO (Identity Preference Optimization) for out-of-distribution stability; KTO for unpaired preferences; SimPO for simplified reference-free training. The alignment field is moving fast; expect new variants regularly.

Constitutional AI and RLAIF

Anthropic's Constitutional AI (CAI) approach replaces human preference labelers with AI feedback. A "constitution" — a set of principles like "be helpful, harmless, and honest" — is used to generate AI preference judgments at scale, which are then used in RLAIF (Reinforcement Learning from AI Feedback). CAI scales preference data collection dramatically and addresses labeler fatigue and value drift, at the cost of inheriting any biases in the AI judge.

Choosing in Practice

For most practitioners: start with DPO. It's simpler, more stable, and produces good results for most alignment tasks. Use RLHF (or its modern successors) when you need the model to discover outputs better than your training data, or when you have the resources to run the full pipeline and benefit from online preference collection.