CLIP: Connecting Vision and Language

The Power of Unlabeled Image-Text Pairs

CLIP (Contrastive Language-Image Pre-Training), from OpenAI in 2021, changed how we think about visual representation learning. Instead of training classifiers on labeled datasets, CLIP learns from 400 million (image, text) pairs scraped from the internet — pairs where a caption describes or discusses an image. The result: a model that understands images in terms of natural language, enabling zero-shot classification of arbitrary categories and cross-modal retrieval.

The Architecture

CLIP has two encoders: a vision encoder (ViT or ResNet) that maps images to embedding vectors, and a text encoder (transformer) that maps text to embedding vectors. Both embeddings live in the same shared vector space, trained such that matching image-text pairs have similar embeddings and non-matching pairs have dissimilar embeddings.

Training uses contrastive loss: for a batch of N image-text pairs, the model learns to maximize similarity between the N matching pairs and minimize similarity between the N² - N non-matching pairs. This creates a rich embedding space where proximity means semantic relevance across modalities.

Zero-Shot Classification

CLIP's most striking capability is zero-shot image classification. To classify an image into K categories without any category-specific training:

For each category, create a text prompt: "A photo of a [category]."
Embed all text prompts and the query image
Find the text embedding most similar to the image embedding
The corresponding category is the classification

On ImageNet, zero-shot CLIP achieves 76.2% top-1 accuracy — competitive with many supervised ResNet models, without seeing a single labeled ImageNet example. This generalizes to any category that can be described in text, including categories that don't exist in any training set.

Applications and Derivatives

CLIP embeddings are widely used for: image retrieval (find images matching a text query), filtering training data (select images matching specific criteria), open-vocabulary object detection (ground any textual description to image regions), and as the image encoder in text-to-image models (Stable Diffusion, DALL-E).

Significant derivatives include SigLIP (Google, 2023: improved training efficiency), ALIGN (Google: larger training data), OpenCLIP (open-source reimplementation with multiple model sizes), and EVA-CLIP (improved ViT backbone). CLIP-family models are foundational infrastructure in modern multimodal AI.