Learning the structure of image collections with latent aspect models

The approach to indexing an image collection depends on the type of data to organize. Satellite images are likely to be searched with latitude and longitude coordinates, medical images are often searched with an image example that serves as a visual query, and personal image collections are generally browsed by event. A more general retrieval scenario is based on the use of textual keywords to search for images containing a specific object, or representing a given scene type. This requires the manual annotation of each image in the collection to allow for the retrieval of relevant visual information based on a text query. This time-consuming and subjective process is the current price to pay for a reliable and convenient text-based image search. This dissertation investigates the use of probabilistic models to assist the automatic organization of image collections, attempting to link the visual content of digital images with a potential textual description. Relying on robust, patch-based image representations that have proven to capture a variety of visual content, our work proposes to model images as mixtures of latent aspects. These latent aspects are defined by multinomial distributions that capture patch co-occurrence information observed in the collection. An image is not represented by the direct count of its constituting elements, but as a mixture of latent aspects that can be estimated with principled, generative unsupervised learning methods. An aspect-based image representation therefore incorporates contextual information from the whole collection that can be exploited. This emerging concept is explored for several fundamental tasks related to image retrieval - namely classification, clustering, segmentation, and annotation - in what represents one of the first coherent and comprehensive study of the subject. We first investigate the possibility of classifying images based on their estimated aspect mixture weights, interpreting latent aspect modeling as an unsupervised feature extraction process. Several image categorization tasks are considered, where images are classified based on the present objects or according to their global scene type. We demonstrate that the concept of latent aspects allows to take advantage of non-labeled data to infer a robust image representation that achieves a higher classification performance than the original patch-based representation. Secondly, further exploring the concept, we show that aspects can correspond to an interesting soft clustering of an image collection that can serve as a browsing structure. Images can be ranked given an aspect, illustrating the corresponding co-occurrence context visually. In the third place, we derive a principled method that relies on latent aspects to classify image patches into different categories. This produces an image segmentation based on the resulting spatial class-densities. We finally propose to model images and their caption with a single aspect model, merging the co-occurrence contexts of the visual and the textual modalities in different ways. Once a model has been learned, the distribution of words given an unseen image is inferred based on its visual representation, and serves as textual indexing. Overall, we demonstrate with extensive experiments that the co-occurrence context captured by latent aspects is suitable for the above mentioned tasks, making it a promising approach for multimedia indexing.

Related material