The Mountain in the Dark
Optimization in machine learning is like trying to find the lowest point in a mountain range while blindfolded. You can feel the slope under your feet (the gradient), but you can't see the landscape. Gradient descent is the simplest strategy: always step in the direction that slopes downward. This simple idea, combined with careful tuning and clever variants, is how virtually all neural networks are trained.
Vanilla Gradient Descent
The basic update rule:
θ ← θ - η ∇_θ L(θ)
Where η is the learning rate and ∇_θ L is the gradient of the loss with respect to parameters. Take small steps in the direction of steepest descent. The learning rate η is the most critical hyperparameter: too large and training diverges; too small and training is impossibly slow.
For a quadratic loss (the simplest case), gradient descent converges geometrically: the error multiplies by (1 - ηλ) at each step, where λ is the smallest eigenvalue of the Hessian. Optimal convergence rate requires η = 2/(λ_min + λ_max), which requires knowing the Hessian's spectrum — not practical for neural networks.
Stochastic Gradient Descent
Computing the exact gradient requires evaluating the loss over the full dataset — expensive for millions of examples. SGD (Stochastic Gradient Descent) approximates the gradient using a random mini-batch:
∇̃L ≈ (1/B) Σ_{i∈batch} ∇L(xᵢ, yᵢ; θ)
This is noisy but much faster per step than full-batch gradient descent. Surprisingly, SGD's noise is often beneficial: it acts as regularization, helps escape sharp minima, and is thought to bias optimization toward flatter minima that generalize better (the "flat minima" hypothesis).
Momentum
SGD oscillates in directions of high curvature. Momentum smooths this by accumulating a velocity vector that decays over time:
v ← μv - η∇L
θ ← θ + v
With momentum μ = 0.9, the optimizer accumulates gradients from the past ~10 steps, dampening oscillations and enabling larger effective learning rates. Nesterov momentum computes the gradient slightly ahead of the current position for better theoretical convergence.
Adam and AdamW
Adam adapts the learning rate per parameter based on estimates of first and second moments of the gradient:
m ← β₁m + (1-β₁)g # first moment (mean)
v ← β₂v + (1-β₂)g² # second moment (variance)
m̂ = m/(1-β₁ᵗ) # bias correction
v̂ = v/(1-β₂ᵗ) # bias correction
θ ← θ - η m̂/(√v̂ + ε)
Parameters with consistently large gradients get smaller effective learning rates (v̂ grows large). Parameters with small or noisy gradients get larger effective learning rates. This automatic adaptation is why Adam usually works better than SGD without careful tuning.
AdamW (Loshchilov & Hutter, 2019) fixes a subtle bug in L2 regularization with Adam: instead of adding the regularization term to the gradient (which Adam downscales), AdamW applies weight decay directly to the parameters. This produces better regularization and is the default optimizer for most large model training.