An Ad Hoc Information Retrieval Perspective on PLSI through language model identification
Ten years ago, PLSI opened the road to probabilistic latent semantic representations of documents. It led to a number of applications in different ﬁelds, including ad hoc Information Retrieval. However, inherent limitations hinder its use on documents not seen during learning. This paper proposes a new document–query similarity for PLSI based on language modeling that allows queries to be used in PLSI without the usual folding-in phase. We compare this similarity to Fisher kernels, the state-of-the-art approach for PLSI. In this perspective, we complete the study of the impact of the Fisher Information Matrix, and of how latent-topics and word components contribute to the kernel performance. We furthermore present an evaluation of PLSI with learning performed on a corpus of over one million word occurrences, coming from the TREC–AP evaluation collection, a particularly large corpus for parameter estimation in the PLSI framework.