Saliency-based Representations and Multi-component Classifiers for Visual Scene Recognition
Visual scene recognition deals with the problem of automatically recognizing the high-level semantic concept describing a given image as a whole, such as the environment in which the scene is occurring (e.g. a mountain), or the event that is taking place (e.g. a rock climbing event). Scene categories, especially those related to man-made places and events, present high degrees of intra-class variability and inter-class similarity, which in turn require robust and discriminative recognition systems. An additional requirement for potential applications, such as vision-based spatial reasoning for mobile robots, is efficiency of the classification procedure. The objective of this thesis is to address these challenges, by proposing suitable image representations and classification algorithms. The first part of the thesis focuses on the representation task. We propose a bottom-up image descriptor capturing perceptually coherent structures independently of their position. In particular, our method separately pools features extracted from two perceptually different image regions: the most salient region and the remaining non-salient one. By complementing this Saliency-driven Perceptual Pooling (SPP) with an ad-hoc spatial pooling operation, we obtain compact and robust image representations, particularly suited for indoor and sports scenes. The second part of the thesis is concerned with the classification step. We propose an efficient multi-component classification algorithm, named Multiclass Latent Locally Linear SVM (ML3), able to automatically learn a set of sub-categorical linear models for each class, in a principled latent SVM framework. By linearly combining the sub-categorical models with sample and class specific weights, ML3 is able to efficiently learn smooth non-linear decision boundaries, competitive with those obtained by Gaussian kernel SVMs. ML3 also shows very competitive trade-offs between training time and performance, while ensuring high efficiency of the prediction phase. In the last part of the thesis, we use the ML3 algorithm to improve the efficiency and performance of a recently proposed image classification algorithm, named NBNN, designed to cope with classes with a large diversity. Specifically, we show how with a modification of the NBNN scoring function it is possible to use ML3 to learn a discriminative and compact set of prototypical local features for each class, and thus avoid the extensive Nearest Neighbor search used by NBNN. The resulting algorithm, named NBNL, greatly reduces the memory requirements and testing complexity of NBNN, while significantly improving its performance. The approaches proposed in this thesis effectively exploit the spatial, salient and task-driven structures present in the images, producing compact representations and relatively efficient classification procedures.The SPP representations provide competitive scene recognition performances when coupled with non-linear kernels, while the ML3 algorithm can be used to partially fill the gap between linear and non-linear kernels. Although the performance of NBNN-based methods on scene recognition tasks is still below the one obtained by traditional SVM-based approaches, the proposed NBNL algorithm reduces the performance gap, while significantly speeding up the testing phase. Experiments on three publicly available scene recognition datasets (MIT-Indoor-67, 15-Scenes and UIUC-Sports) show the value of the proposed approaches.