SAM and the Democratization of Image Segmentation

The Segmentation Problem

Image segmentation — identifying which pixels belong to which object — has historically required large amounts of labeled training data specific to each application domain. A model trained on medical images can't segment satellite imagery. A model trained on indoor scenes doesn't work on microscopy slides. This data requirement has been a major barrier to deploying segmentation in new applications.

The Segment Anything Model (SAM), from Meta AI Research in 2023, changed this by building a foundation model for segmentation — one that generalizes to new domains and objects without retraining.

SAM's Architecture

SAM consists of three components: an image encoder (heavyweight ViT that runs once per image), a prompt encoder (lightweight, for interactive prompts), and a mask decoder (lightweight transformer that outputs segmentation masks). The image encoder produces a feature representation of the entire image; the prompt encoder handles point, box, mask, or text prompts that indicate what to segment; the mask decoder generates the actual segmentation mask.

The separation of heavy (image encoding) from light (prompt + decode) is key for interactive use: you precompute the expensive image encoding once, then respond to user prompts at real-time speeds.

The SA-1B Dataset

SAM's capabilities come from scale: it was trained on SA-1B, a dataset of 1 billion segmentation masks across 11 million images, assembled through a "data engine" that used SAM itself (in progressively capable versions) to annotate images with human review. No previously available segmentation dataset came close to this scale.

Zero-Shot Generalization

SAM was tested on 23 diverse segmentation benchmarks it had never seen during training, ranging from underwater footage to microscopy to satellite imagery. It outperformed prior state-of-the-art segmentation models on 16 of 23 benchmarks, without any fine-tuning. This zero-shot performance demonstrates genuine generalization rather than task-specific optimization.

SAM 2: Video Segmentation

SAM 2 (2024) extended SAM to video: given a prompt in any frame, SAM 2 propagates the segmentation through the entire video, handling occlusion and reappearance. This is significantly harder than per-frame segmentation because objects change appearance, get occluded, and require tracking memory. SAM 2 sets state-of-the-art results on video object segmentation benchmarks while maintaining the interactive, promptable interface of the original SAM.