Indoor Scene Recognition using Task and Saliency-driven Feature Pooling
Indoor scenes are characterized by a high intra-class variability, mainly due to the intrinsic variety of the objects in them, and to the drastic image variations due to (even small) view-point changes. One of the main trends in the literature has been to employ representations coupling statistical characterizations of the image, with a description of their spatial distribution. This is usually done by combining multiple representations of different image regions, most often using a fixed 4x4, or pyramidal image-partitioning scheme. While these encodings are able to capture the spatial regularities of the problem, they are unsuitable to handle its spatial variabilities. In this work we propose to complement a traditional spatial-encoding scheme with a bottom-up approach designed to discover visual-structures regardless of their exact position in the scene. To this end we use saliency maps to segment each image in two regions: the most and least salient 50%. This segmentation provides a description of the images which is somehow related to the relative semantics of the discovered regions, complementing the canonical spatial description. We evaluated the proposed technique on three public scene recognition datasets. Our results prove this approach to be effective in the indoor scenario, while being also meaningful for other scene categorization tasks.
Fornoni_BMVC_2012.pdf
openaccess
3.03 MB
Adobe PDF
780be9b727b425f9db8479f18b7622bf