An Ad Hoc Information Retrieval Perspective on PLSI through language model identification

Chappelier, Jean-Cedric; Eckard, Emmanuel

doi:10.1007/978-3-642-04417-5_36

Chappelier, Jean-Cedric; Eckard, Emmanuel

2009

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Abstract

Ten years ago, PLSI opened the road to probabilistic latent semantic representations of documents. It led to a number of applications in different ﬁelds, including ad hoc Information Retrieval. However, inherent limitations hinder its use on documents not seen during learning. This paper proposes a new document–query similarity for PLSI based on language modeling that allows queries to be used in PLSI without the usual folding-in phase. We compare this similarity to Fisher kernels, the state-of-the-art approach for PLSI. In this perspective, we complete the study of the impact of the Fisher Information Matrix, and of how latent-topics and word components contribute to the kernel performance. We furthermore present an evaluation of PLSI with learning performed on a corpus of over one million word occurrences, coming from the TREC–AP evaluation collection, a particularly large corpus for parameter estimation in the PLSI framework.