The Power of Unlabeled Image-Text Pairs
CLIP (Contrastive Language-Image Pre-Training), from OpenAI in 2021, changed how we think about visual representation learning. Instead of training classifiers on labeled datasets, CLIP learns from 400 million (image, text) pairs scraped from the internet — pairs where a caption describes or discusses an image. The result: a model that understands images in terms of natural language, enabling zero-shot classification of arbitrary categories and cross-modal retrieval.
The Architecture
CLIP has two encoders: a vision encoder (ViT or ResNet) that maps images to embedding vectors, and a text encoder (transformer) that maps text to embedding vectors. Both embeddings live in the same shared vector space, trained such that matching image-text pairs have similar embeddings and non-matching pairs have dissimilar embeddings.
Training uses contrastive loss: for a batch of N image-text pairs, the model learns to maximize similarity between the N matching pairs and minimize similarity between the N² - N non-matching pairs. This creates a rich embedding space where proximity means semantic relevance across modalities.
Zero-Shot Classification
CLIP's most striking capability is zero-shot image classification. To classify an image into K categories without any category-specific training:
- For each category, create a text prompt: "A photo of a [category]."
- Embed all text prompts and the query image
- Find the text embedding most similar to the image embedding
- The corresponding category is the classification
On ImageNet, zero-shot CLIP achieves 76.2% top-1 accuracy — competitive with many supervised ResNet models, without seeing a single labeled ImageNet example. This generalizes to any category that can be described in text, including categories that don't exist in any training set.
Applications and Derivatives
CLIP embeddings are widely used for: image retrieval (find images matching a text query), filtering training data (select images matching specific criteria), open-vocabulary object detection (ground any textual description to image regions), and as the image encoder in text-to-image models (Stable Diffusion, DALL-E).
Significant derivatives include SigLIP (Google, 2023: improved training efficiency), ALIGN (Google: larger training data), OpenCLIP (open-source reimplementation with multiple model sizes), and EVA-CLIP (improved ViT backbone). CLIP-family models are foundational infrastructure in modern multimodal AI.