Course Description
Optimization theory for deep learning. Convex optimization: definitions, gradient inequalities, convergence rates. First-order methods: gradient descent, projected gradient, proximal gradient. Stochastic gradient descent: convergence, learning rate schedules, noise as regularization. Adaptive methods: AdaGrad, RMSprop, Adam, AdamW — derivation and analysis. Non-convex optimization: saddle points, local minima, loss landscape visualization. Second-order methods and why they're mostly impractical at scale.