Learning to Detect Objects with Minimal Supervision

Many classes of objects can now be successfully detected with statistical machine learning techniques. Faces, cars and pedestrians, have all been detected with low error rates by learning their appearance in a highly generic manner from extensive training sets. These recent advances have enabled the use of reliable object detection components in real systems, such as automatic face focusing functions on digital cameras. One key drawback of these methods, and the issue addressed here, is the prohibitive requirement that training sets contain thousands of manually annotated examples. We present three methods which make headway toward reducing labeling requirements and in turn, toward a tractable solution to the general detection problem. First, we propose a new learning strategy for object detection. The proposed scheme forgoes the need to train a collection of detectors dedicated to homogeneous families of poses, and instead learns a single classifier that has the inherent ability to deform based on the signal of interest. We train a detector with a standard AdaBoost procedure by using combinations of pose-indexed features and pose estimators. This allows the learning process to select and combine various estimates of the pose with features able to compensate for variations in pose without the need to label data for training or explore the pose space in testing. We validate our framework on three types of data: hand video sequences, aerial images of cars, as well as face images. We compare our method to a standard Boosting framework, with access to the same ground truth, and show a reduction in the false alarm rate of up to an order of magnitude. Where possible, we compare our method to the state-of-the art, which requires pose annotations of the training data, and demonstrate comparable performance. Second, we propose a new learning method which exploits temporal consistency to successfully learn a complex appearance model from a sparsely labeled training video. Our approach consists in iteratively improving an appearance-based model built with a Boosting procedure, and the reconstruction of trajectories corresponding to the motion of multiple targets. We demonstrate the efficiency of our procedure by learning a pedestrian detector from videos and a cell detector from microscopy image sequences. In both cases, our method is demonstrated to reduce the labeling requirement by one to two orders of magnitude. We show that in some instances, our method trained with sparse labels on a video sequence is able to outperform a standard learning procedure trained with the fully labeled sequence. Third, we propose a new active learning procedure which exploits the spatial structure of image data and queries entire scenes or frames of a video rather than individual examples. We extend the Query by Committee approach allowing it to characterize the most informative scenes that are to be selected for labeling. We show that an aggressive procedure which exhibits zero tolerance to target localization error performs as well as more sophisticated strategies taking into account the trade-off between missed detections and localization error. Finally, we combine this method with our two proposed approaches above and demonstrate that the resulting algorithm can properly perform car detection from a small set of annotated image as well as pedestrian detection from a handful of labeled video frames.

Fua, Pascal
Fleuret, François
Lausanne, EPFL
Other identifiers:
urn: urn:nbn:ch:bel-epfl-thesis5310-4

Note: The status of this file is: Anyone

 Record created 2012-02-02, last modified 2020-04-20

Texte intégral / Full text:
Download fulltext

Rate this document:

Rate this document:
(Not yet reviewed)