Machine Learning Approaches to Text Representation using Unlabeled Data

With the rapid expansion in the use of computers for producing digitalized textual documents, the need of automatic systems for organizing and retrieving the information contained in large databases has become essential. In general, information retrieval systems rely on a formal description or representation of documents enabling their automatic processing. In the most common representation, the so-called bag-of-words, documents are represented by the words composing them and two documents (or a user query and a document) are considered similar if they have a high number of co-occurring words. In this representation, documents with different, but semantically related terms will be considered as unrelated, and documents using the same terms but in different contexts will be seen as similar. It arises quite naturally that information retrieval systems can use the huge amount of existing textual documents in order to ``learn'', as humans do, the different uses of words depending on the context. This information can be used to enrich documents' representation. In this thesis dissertation we develop several original machine learning approaches which attempt at fulfilling this aim. As a first approach to document representation we propose a probabilistic model in which documents are assumed to be issued from a mixture of distributions over themes, modeled by a hidden variable conditioning a multinomial distribution over words. Simultaneously, words are assumed to be drawn from a mixture of distributions over topics, modeled by a second hidden variable dependent on the themes. As a second approach, we proposed a neural network which is trained to give a score for the appropriateness of a word in a given context. Finally we present, a multi-task learning approach, which is trained jointly to solve an information retrieval task, while learning on unlabeled data to improve its representation of documents.

Related material