The Object Detection Problem
Object detection is the task of identifying all objects of interest in an image and drawing bounding boxes around them. It's one of the most practically important computer vision tasks: it underlies autonomous driving, security systems, medical imaging, inventory management, and countless other applications. The field has undergone several architectural revolutions since 2014.
Two-Stage Detectors: Precision First
Two-stage detectors (Faster R-CNN, Mask R-CNN) first propose candidate object regions, then classify and refine them. Stage 1: a Region Proposal Network (RPN) uses sliding windows over feature maps to generate candidate bounding boxes. Stage 2: an ROI pooling operation extracts features for each proposal, which are then classified and box-refined. This approach is highly accurate but slower due to the two-stage pipeline.
YOLO: Real-Time Detection
YOLO (You Only Look Once) prioritizes speed by treating detection as a single regression problem. The image is divided into an S×S grid; each grid cell predicts B bounding boxes and class probabilities. This single-pass approach achieves real-time detection speeds (30-100+ FPS) that two-stage detectors can't match. The tradeoff: historically lower accuracy, especially on small objects.
YOLOv8 (Ultralytics, 2023) and YOLOv9/v10/v11 have dramatically narrowed the accuracy gap with two-stage detectors while maintaining excellent speed. On COCO, YOLOv9 achieves 55.6% mAP at 53.6 FPS on an A100 — competitive with much slower approaches.
DETR: Detection as Set Prediction
DETR (DEtection TRansformer, 2020) reformulated object detection using a transformer encoder-decoder and treating detection as a set prediction problem. No anchor boxes, no non-maximum suppression, no hand-crafted components — the model directly outputs a fixed set of predicted objects. The simplicity is appealing; the training is expensive and slow.
DETR variants (Deformable DETR, DINO-DETR, RT-DETR) have addressed the training speed problem and significantly improved performance. RT-DETR (Real-Time DETR) achieves 53.1% mAP at 108 FPS on A100 — showing that transformer-based detection can be both accurate and fast.
Open-Vocabulary and Grounding Models
The frontier in 2026 is open-vocabulary detection: detecting objects described in natural language rather than fixed category lists. GroundingDINO, GDINO-1.5, and similar models can detect "the person wearing a red hat" or "the broken circuit board" without any category-specific training. This enables new applications like instruction-following visual agents and zero-shot quality control systems.