000146078 001__ 146078
000146078 005__ 20190316234722.0
000146078 037__ $$aTHESIS_LIB
000146078 245__ $$aMachine Learning Approaches to Text Representation using Unlabeled Data
000146078 269__ $$a2006
000146078 260__ $$bEcole Polytechnique Fédérale de Lausanne$$c2006
000146078 336__ $$aTheses
000146078 500__ $$aIDIAP-RR 06-76
000146078 520__ $$aWith the rapid expansion in the use of computers for producing digitalized textual documents, the need of automatic systems for organizing and retrieving the information contained in large databases has become essential. In general, information retrieval systems rely on a formal description or representation of documents enabling their automatic processing. In the most common representation, the so-called bag-of-words, documents are represented by the words composing them and two documents (or a user query and a document) are considered similar if they have a high number of co-occurring words. In this representation, documents with different, but semantically related terms will be considered as unrelated, and documents using the same terms but in different contexts will be seen as similar. It arises quite naturally that information retrieval systems can use the huge amount of existing textual documents in order to ``learn'', as humans do, the different uses of words depending on the context. This information can be used to enrich documents' representation. In this thesis dissertation we develop several original machine learning approaches which attempt at fulfilling this aim. As a first approach to document representation we propose a probabilistic model in which documents are assumed to be issued from a mixture of distributions over themes, modeled by a hidden variable conditioning a multinomial distribution over words. Simultaneously, words are assumed to be drawn from a mixture of distributions over topics, modeled by a second hidden variable dependent on the themes. As a second approach, we proposed a neural network which is trained to give a score for the appropriateness of a word in a given context. Finally we present, a multi-task learning approach, which is trained jointly to solve an information retrieval task, while learning on unlabeled data to improve its representation of documents.
000146078 700__ $$aKeller, Mikaela
000146078 8564_ $$uhttp://publications.idiap.ch/downloads/papers/2006/keller-phd-2006.pdf$$zURL
000146078 8564_ $$uhttp://publications.idiap.ch/index.php/publications/showcite/keller:rr06-76$$zRelated documents
000146078 8564_ $$uhttps://infoscience.epfl.ch/record/146078/files/keller-phd-2006.pdf$$s989021
000146078 909C0 $$xU10381$$0252189$$pLIDIAP
000146078 909CO $$qGLOBAL_SET$$pSTI$$ooai:infoscience.tind.io:146078
000146078 970__ $$akeller:phd:2006/LIDIAP
000146078 973__ $$sPUBLISHED$$aOTHER
000146078 980__ $$aTHESIS