Probability Theory for Deep Learning

Why Probability Is Non-Negotiable

Deep learning is fundamentally probabilistic: models output probability distributions, loss functions come from probability theory, regularization has Bayesian interpretations, and uncertainty quantification requires probabilistic reasoning. You can write code without understanding probability, but you can't debug, evaluate, or improve systems without it.

Basic Probability

A probability distribution assigns probabilities to events such that all probabilities are non-negative and sum to 1. Key distributions for ML:

Gaussian (Normal): Most common. Central limit theorem says sums of random variables converge to Gaussian. Shows up in weight initialization, noise models, VAE priors.
Bernoulli/Categorical: Binary and multi-class classification outputs. Categorical cross-entropy is the natural loss for categorical distributions.
Uniform: Used for random initialization, some regularization.
Dirichlet: Distribution over probability vectors; used in topic models and Bayesian neural networks.

Bayes' Theorem

Bayes' theorem connects prior beliefs to updated beliefs after observing data:

p(θ|D) = p(D|θ) p(θ) / p(D)

Posterior ∝ Likelihood × Prior. Maximum likelihood estimation (what most deep learning does) maximizes p(D|θ) ignoring the prior. Maximum a posteriori (MAP) estimation maximizes p(D|θ)p(θ) — which is ML with regularization. L2 regularization corresponds to a Gaussian prior on weights; L1 regularization corresponds to a Laplace prior.

Expectation, Variance, and Moments

Expected value E[X] = Σ x p(x) is the probability-weighted average. Variance Var[X] = E[(X-E[X])²] = E[X²] - (E[X])² measures spread. Standard deviation is the square root of variance.

In training: batch statistics (used in batch normalization) are estimates of population means and variances. Gradient variance is a key optimization concept — high variance means unreliable gradient estimates and unstable training. Adam's adaptive learning rates reduce the effect of variance in gradient estimates.

Monte Carlo Methods

Many integrals and expectations in ML can't be computed analytically. Monte Carlo methods approximate them by sampling: E[f(X)] ≈ (1/N) Σ f(xᵢ) where xᵢ ~ p(X). The law of large numbers guarantees convergence. Applications: SGD (Monte Carlo estimate of gradient), REINFORCE (Monte Carlo estimate of policy gradient), variational inference (Monte Carlo estimation of the ELBO).

# ============================================================ # SCHOOL OF SOCIETY & GOVERNANCE # ============================================================