Object Classification and Detection in High Dimensional Feature Space

Object classification and detection aim at recognizing and localizing objects in real-world images. They are fundamental computer vision problems and a prerequisite for full scene understanding. Their difficulty lies in the large number of possible object positions and the appearance variations of object classes. This thesis improves upon several classical machine learning algorithms, enabling large computational gains in high dimensional feature space. A common trend in machine learning and computer vision research is to go large scale. In particular, the advent of huge datasets mined from the Internet, and the combination of multiple feature sources have considerably broadened the applications of computer vision. Tasks which were thought impossible a few years ago, such as human action recognition or pose estimation, automatic outdoor navigation, etc., now seem within reach. This dissertation is divided into two parts. The first one deals with the efficient training of a classifier or detector based on a large number of feature extractors, outside the control of the learning algorithm, and therefore of unknown suitability to the task at hand. More precisely, this part presents two kinds of strategies to accelerate the training of Boosting algorithms in such a context: (a) a method to better deal with the increasingly common case where features come from multiple sources (e.g. color, shape, texture, etc., in the case of images) and therefore can be partitioned into meaningful subsets; (b) new algorithms which balance at every Boosting iteration the number of weak learners and the number of training examples to look at in order to maximize the expected loss reduction. Experiments in image classification and object recognition on four standard computer vision datasets show that the adaptive techniques we propose outperform both basic sampling and state-of-the-art bandit methods. The second part deals with linear object detectors, currently the most popular class of detection systems, encompassing template matching, deformable part models, poselets, convolutional neural networks (which internally use linear filters), etc. The main bottleneck of many of those systems is the computational cost of the convolutions between the multiple rescalings of the image to process and the linear filters. We make use of properties of the Fourier transform and clever implementation strategies to obtain a speedup factor proportional to the filter size, both while training and at test time. We also introduce a few modifications to the original Deformable Part Model (DPM) of Felzenszwalb et al. improving its detection accuracy. The gains in performance are demonstrated on the well-known Pascal VOC benchmark, where an increase by one order of magnitude in the speed of said convolutions, and an average improvement of 15% in the accuracy of the detector are established.

Related material