Research Paper Summary

You Only Look Once: Unified, Real-Time Object Detection

by Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi

Abstract

YOLO frames object detection as a single regression problem, predicting bounding boxes and class probabilities directly from images in one evaluation. This unified approach allows for real-time object detection at 45 frames per second with high accuracy.

Key Highlights

Real-time object detection at 45 FPS (base model) and 155 FPS (Fast YOLO).
Unified architecture simplifies and optimizes the detection pipeline.
Outperforms state-of-the-art models in generalization to new domains.

Methodology

The YOLO model divides the image into an S x S grid, predicting bounding boxes, confidence scores, and class probabilities for each grid cell. A single convolutional neural network is used for end-to-end training and testing.

Results and Key Findings

Mean Average Precision (mAP) of 63.4% on the PASCAL VOC 2007 dataset.
Fewer background errors compared to Fast R-CNN.
Generalizes well to artwork and other non-natural image datasets.

Applications and Impacts

YOLO's speed and simplicity make it ideal for real-time applications such as autonomous driving, video surveillance, and robotic vision systems.

Conclusion

YOLO represents a breakthrough in object detection by unifying the detection process into a single, fast, and efficient system. Its generalizability and real-time capabilities make it a cornerstone in computer vision research.