Lecture: Diffusion Models from First Principles

Learning Objectives

Derive the diffusion forward and reverse processes from first principles. Implement DDPM training and sampling in PyTorch. Understand classifier-free guidance. Connect DDPM to latent diffusion and flow matching.

1. The Problem: Learning to Generate

We want to learn to sample from an unknown data distribution p_data(x) — the distribution of all natural images. If we had p_data, we could just sample from it. Instead, we have samples from p_data (training images) and need to learn the distribution.

Diffusion models take an indirect approach: instead of learning p_data directly (intractable), learn to reverse a known destruction process that converts p_data to pure noise.

2. The Forward Process: Destroying Data

Define a T-step process that gradually adds Gaussian noise to data. At each step t:

q(x_t | x_{t-1}) = N(x_t; sqrt(1 - β_t) * x_{t-1}, β_t * I)

Where β_t ∈ (0,1) is a variance schedule that increases from β_1 ≈ 0.0001 to β_T ≈ 0.02

Key property (derived by telescoping the product):

q(x_t | x_0) = N(x_t; sqrt(ᾱ_t) * x_0, (1 - ᾱ_t) * I)

Where ᾱ_t = ∏_{s=1}^t (1 - β_s)

This lets us sample x_t directly from x_0 without running t steps:
x_t = sqrt(ᾱ_t) * x_0 + sqrt(1 - ᾱ_t) * ε   where ε ~ N(0, I)

3. Training the Denoising Network

The model ε_θ(x_t, t) learns to predict the noise ε that was added:

import torch
import torch.nn as nn
import torch.nn.functional as F

def p_losses(denoise_model, x_start, t, noise=None):
"""
Simple DDPM training objective: predict the added noise.
"""
if noise is None:
    noise = torch.randn_like(x_start)

# Get noisy image at timestep t
alphas_cumprod = ... # precomputed schedule
sqrt_alphas_cumprod = alphas_cumprod.sqrt()
sqrt_one_minus_alphas_cumprod = (1 - alphas_cumprod).sqrt()

x_noisy = (sqrt_alphas_cumprod[t] * x_start +
           sqrt_one_minus_alphas_cumprod[t] * noise)

# Predict the noise
predicted_noise = denoise_model(x_noisy, t)

# MSE loss between predicted and actual noise
loss = F.mse_loss(noise, predicted_noise)
return loss

4. DDPM Sampling: The Reverse Process

@torch.no_grad()
def p_sample(model, x, t, t_index, betas, posterior_variance):
"""Single denoising step from x_t to x_{t-1}."""

# Predict noise
predicted_noise = model(x, t)

# Compute the mean of p(x_{t-1} | x_t, x_0_hat)
alpha = 1 - betas[t_index]
alpha_cumprod = ...  # ᾱ_t

mean = (1 / alpha.sqrt()) * (x -
       betas[t_index] / (1 - alpha_cumprod).sqrt() * predicted_noise)

if t_index == 0:
    return mean
else:
    # Add noise scaled by the posterior variance
    noise = torch.randn_like(x)
    return mean + posterior_variance[t_index].sqrt() * noise

@torch.no_grad()
def sample(model, image_size, batch_size=16, channels=3):
"""Full DDPM sampling: start from noise, denoise T times."""

shape = (batch_size, channels, image_size, image_size)
x = torch.randn(shape)  # Start from pure noise

for i in reversed(range(T)):
    t = torch.full((batch_size,), i, dtype=torch.long)
    x = p_sample(model, x, t, i, betas, posterior_variance)

return x

5. Classifier-Free Guidance

To condition generation on a text prompt (or class), classifier-free guidance (CFG) trains the model jointly on conditional and unconditional objectives, then at inference linearly combines them:

# At training time: randomly drop condition with probability p_uncond
# At inference time:
guided_noise = (1 + w) * conditional_noise - w * unconditional_noise

# Where w is the guidance scale:
# w=0: no guidance (unconditional sampling)
# w=7.5: typical for image generation (strong guidance)
# w=15+: over-guidance produces oversaturated, less natural images
Exercise

Train a small DDPM on the MNIST dataset (28×28 grayscale images). Monitor training loss and sample images at epochs 10, 50, and 100. At what point do generated digits become recognizable? How does the guidance scale affect sample quality for digit-conditioned generation?