An Ad Hoc Information Retrieval Perspective on PLSI through language model identification

Chappelier, Jean-CedricEckard, Emmanuel2009-07-062009-07-062009-07-06200910.1007/978-3-642-04417-5_36https://infoscience.epfl.ch/handle/20.500.14299/41085WOS:000271806000035Ten years ago, PLSI opened the road to probabilistic latent semantic representations of documents. It led to a number of applications in different ﬁelds, including ad hoc Information Retrieval. However, inherent limitations hinder its use on documents not seen during learning. This paper proposes a new document–query similarity for PLSI based on language modeling that allows queries to be used in PLSI without the usual folding-in phase. We compare this similarity to Fisher kernels, the state-of-the-art approach for PLSI. In this perspective, we complete the study of the impact of the Fisher Information Matrix, and of how latent-topics and word components contribute to the kernel performance. We furthermore present an evaluation of PLSI with learning performed on a corpus of over one million word occurrences, coming from the TREC–AP evaluation collection, a particularly large corpus for parameter estimation in the PLSI framework.PLSIInformation retrievalLanguage modellingAn Ad Hoc Information Retrieval Perspective on PLSI through language model identificationtext::conference output::conference proceedings::conference paper