Policy Gradients: The Math Behind Modern RL

The Optimization Problem

Reinforcement learning seeks to maximize expected cumulative reward. Formally: find a policy π(a|s) — a probability distribution over actions given states — that maximizes:

J(π) = E[Σ_t γ^t r_t | π]

Where γ is the discount factor and r_t is the reward at time t. Policy gradient methods optimize J directly by computing its gradient with respect to the policy parameters and ascending the gradient — standard optimization, but for this specific objective.

The Policy Gradient Theorem

The key result (Sutton et al., 1999) shows how to compute ∇J(π) without knowing the environment dynamics:

∇J(π_θ) = E[Q^π(s,a) ∇ log π_θ(a|s)]

Where Q^π(s,a) is the action-value function under π. This tells us: the gradient of expected return equals the expected gradient of the log-policy, weighted by the action-value. Intuitively: increase the probability of actions that lead to high value, decrease the probability of actions that lead to low value.

This "likelihood ratio" trick is fundamental: it allows us to differentiate through expectations of functions of samples from the policy, even though we don't know the environment dynamics that generate the rewards.

REINFORCE and Its Problems

REINFORCE (Williams, 1992) estimates Q^π with Monte Carlo returns: complete an episode, compute the total return from each state-action pair, use these as estimates. Simple but very high variance: returns are noisy because many other actions in the episode affect the total return. High variance means slow, unstable learning.

Actor-Critic Methods

Actor-critic methods reduce variance by using a learned value function (the "critic") to estimate Q^π rather than Monte Carlo returns. The critic is trained to minimize TD error; the actor uses the critic's estimates to compute policy gradients. This significantly reduces variance at the cost of introducing bias from the critic's approximation errors.

A2C/A3C (Asynchronous Advantage Actor-Critic) uses A^π(s,a) = Q^π(s,a) - V^π(s) (the "advantage" — how much better is this action than average?) as the policy gradient weight, further reducing variance. PPO (Proximal Policy Optimization) adds a clipping constraint that prevents the policy from changing too much in any single update, dramatically improving stability. PPO is the current default algorithm for most RL applications including RLHF.

SAC: Entropy Maximization

SAC (Soft Actor-Critic) adds an entropy term to the objective: the policy should not only maximize expected return but also maximize entropy (stay as random as possible while still achieving high reward). This encourages exploration, prevents premature convergence to suboptimal deterministic policies, and improves robustness. SAC is the dominant algorithm for continuous control tasks in robotics and simulation.