Information Theory for Machine Learning Practitioners

Why Information Theory Matters for ML

Shannon's information theory, developed in 1948 for communications engineering, turns out to be deeply connected to machine learning. Cross-entropy — the standard loss function for classification — is an information-theoretic concept. The KL divergence that appears in VAEs, RL, and alignment training is from information theory. The mutual information used to analyze what neural networks learn is from information theory. Understanding these connections gives you a much deeper understanding of why standard ML methods work.

Entropy: Measuring Uncertainty

Shannon entropy measures the average information content of a random variable:

H(X) = -Σ_x p(x) log p(x)

High entropy = high uncertainty. A fair coin has entropy 1 bit. A two-headed coin has entropy 0 (no uncertainty). A uniform distribution over 256 outcomes has entropy 8 bits.

In ML: model uncertainty (how uncertain is the model about this prediction?) is related to entropy of the output distribution. Entropy maximization in SAC (reinforcement learning) explicitly rewards high-entropy policies. The entropy of the training data distribution affects how hard the learning problem is.

Cross-Entropy: The Standard Loss Function

Cross-entropy H(p, q) measures the expected log-loss when using distribution q to encode events drawn from distribution p:

H(p, q) = -Σ_x p(x) log q(x)

In classification: p is the true distribution (one-hot label), q is the model's predicted distribution. Minimizing cross-entropy is exactly maximum likelihood estimation — you're finding model parameters that make the training data most likely. Cross-entropy = entropy + KL divergence: H(p, q) = H(p) + KL(p||q).

KL Divergence: Measuring Distribution Difference

KL divergence (Kullback-Leibler divergence) measures how much one distribution differs from another:

KL(p||q) = Σ_x p(x) log(p(x)/q(x))

It's not symmetric (KL(p||q) ≠ KL(q||p)) and can be interpreted as the extra bits needed to encode samples from p when using an optimal code for q. KL appears everywhere in ML: RLHF/PPO uses KL as a constraint between the policy being trained and the reference policy; VAEs minimize KL between the posterior and the prior; information bottleneck theory uses KL to formalize compression.

Mutual Information: What's Shared Between Variables

Mutual information measures how much information X and Y share:

I(X;Y) = KL(p(x,y) || p(x)p(y)) = H(X) - H(X|Y)

High mutual information = knowing Y tells you a lot about X. Research on "what neural networks learn" often uses mutual information between representations and input/output: good representations should have high I(representation; label) and low I(representation; spurious features). The information bottleneck principle proposes that good representations compress away irrelevant information while retaining task-relevant information.