CV-420: Vision Transformers and Foundation Models

Course Description

The vision transformer revolution. ViT, DeiT, Swin Transformer: patch embedding, positional encoding for 2D, hierarchical designs. CLIP: contrastive pretraining, zero-shot classification, image-text alignment. DINO and DINOv2: self-supervised learning with distillation. Segment Anything Model (SAM): promptable segmentation, architecture, zero-shot transfer. Florence, InternViT, and other vision foundation models.