Diffusion Models Explained: From Noise to Art

The Counterintuitive Process

Diffusion models generate images by learning to reverse a process of gradual noise addition. The insight seems backwards: to learn to create clean images, you first learn to destroy them. But this counterintuitive approach has produced the most capable image generation systems in history, from Stable Diffusion to DALL-E 3 to FLUX.

The Forward Process: Adding Noise

Training starts with real images. A diffusion model's forward process takes a clean image x₀ and adds Gaussian noise in T steps, producing increasingly noisy images x₁, x₂, ..., xₜ, until at step T the image is pure noise. The process is mathematically defined as a Markov chain:

q(xₜ|xₜ₋₁) = N(xₜ; √(1-βₜ)xₜ₋₁, βₜI)

Where βₜ is a noise schedule that controls how much noise is added at each step. After T steps (typically T=1000), the original image is completely unrecognizable. The key property of this process: for any t, we can directly sample xₜ from x₀ without running all t steps:

q(xₜ|x₀) = N(xₜ; √ᾱₜ x₀, (1-ᾱₜ)I)

The Reverse Process: Denoising

The reverse process learns to undo the noise addition step by step, starting from pure noise xₜ and progressively cleaning it toward a coherent image x₀. A neural network (typically a UNet for DDPM, or a transformer-based architecture for newer models) is trained to predict the noise that was added at each step, given the noisy image and the timestep.

During inference, we start with random Gaussian noise and repeatedly apply the learned denoising step, iteratively refining the image. After T denoising steps, we have a sample from the distribution of real images — but a novel image, not a copy of anything in the training data.

DDIM: Faster Sampling

DDPM's 1000-step denoising is slow. DDIM (Denoising Diffusion Implicit Models) reformulates the sampling process as a non-Markovian chain, enabling sampling in 10-50 steps with minimal quality loss. Most practical diffusion systems use DDIM or its variants rather than the original DDPM sampling schedule.

Latent Diffusion: The Key to Scalability

Applying diffusion directly in pixel space is expensive: a 512×512 image has 786,432 pixels. Latent Diffusion Models (LDM), the basis of Stable Diffusion, first encode the image into a lower-dimensional latent representation using a trained variational autoencoder (VAE), then apply diffusion in this latent space. The VAE encodes images to latents ~64× smaller than the original; diffusion in this space is dramatically faster.

Text Conditioning

Text-to-image generation requires conditioning the denoising process on a text description. Most systems use CLIP or a T5-family text encoder to convert the text prompt into a sequence of embeddings, which are then injected into the denoising UNet via cross-attention layers. The UNet attends to text features at each denoising step, guiding the image toward the described content.

Classifier-free guidance (CFG) scales the influence of the text condition: with CFG scale 7, the image strongly adheres to the prompt; with CFG scale 1, it follows the prompt loosely. Higher CFG scale produces more "prompt-accurate" but sometimes less photorealistic images.

FLUX and Flow Matching

FLUX (Black Forest Labs, 2024) uses flow matching rather than diffusion. Instead of learning to denoise Gaussian noise, the model learns to transport samples along straight-line trajectories between the noise distribution and the data distribution. Flow matching is mathematically cleaner and enables better sample quality with fewer function evaluations. FLUX's architecture also uses a "multimodal diffusion transformer" (MMDiT) that processes text and image tokens jointly, producing superior text rendering in generated images compared to earlier architectures.