Course Description
Multimodal systems that process both images and text. CLIP, ALIGN, and contrastive VLMs. LLaVA, InstructBLIP, and instruction-following VLMs. GPT-4V, Claude 3, and commercial multimodal APIs. Architecture patterns: early vs. late fusion, cross-attention, projection layers. Applications: visual QA, document understanding, image captioning, chart analysis. Evaluating multimodal models: benchmarks and failure modes.