Infoscience

Report

Unsupervised Learning for Information Distillation

Current document archives are enormously large and constantly increasing and that makes it practically impossible to make use of them efficiently. To analyze and interpret large volumes of speech and text of these archives in multiple languages and produce structured information of interest to its user, information distillation techniques are used. In order to access the key information in response to a request (query), special text processing techniques such as distillation are required. The task consists of filtering methods to extract the important portions of relevant documents, named as snippets, to a query, as concisely and as correctly as possible. In the context of GALE Project, the queries are matched into several predefined templates. Template 1, which is the main focus of this work, corresponds to listing of facts about an {EVENT}. Answering to template 1 questions is much similar to extraction of general passages from an IR engine. In this work, we are implementing an iterative unsupervised method to answer the queries of this template. The goal of unsupervised learning is to increase the performance of classifier in terms of error rate and f-measure without depending to prior annotated data. The approach consists of using only highly confident features such as word transcriptions extracted from query as well as their synonyms. After forming the bootstrap model using these features, the model is improved using self-training, and is iteratively trained during consecutive runs. Our results indicate that the performance of the system may be improved by more than 12.5% relative in terms of classification error and 31% relative in terms of F-measure, by using the proposed methods.

Related material