Infoscience

Thesis

Discourse-level features for statistical machine translation

Machine Translation (MT) has progressed tremendously in the past two decades. The rule-based and interlingua approaches have been superseded by statistical models, which learn the most likely translations from large parallel corpora. System design does not amount anymore to crafting syntactical transfer rules, nor does it rely on a semantic representation of the text. Instead, a statistical MT system learns the most likely correspondences and re-ordering of chunks of source words and target words from parallel corpora that have been word-aligned. With this procedure and millions of parallel source and target language sentences, systems can generate translations that are intelligible and require minimal post-editing efforts from the human user. Nevertheless, it has been recognized that the statistical MT paradigm may fall short of modeling a number of linguistic phenomena that are established beyond the phrase level. Research in statistical MT has addressed discourse phenomena explicitly only in the past four years. When it comes to textual coherence structure, cohesive ties relate sentences and entire paragraphs argumentatively to each other. This text structure has to be rendered appropriately in the target text so that it conveys the same meaning as the source text. The lexical and syntactical means through which these cohesive markers are expressed may diverge considerably between languages. Frequently, these markers include discourse connectives, which are function words such as however, instead, since, while, which relate spans of text to each other, e.g. for temporal ordering, contrast or causality. Moreover, to establish the same temporal ordering of events described in a text, the conjugation of verbs has to be coherently translated. The present thesis proposes methods for integrating discourse features into statistical MT. We pre-process the source text prior to automatic translation, focusing on two specific discourse phenomena: discourse connectives and verb tenses. Hand-crafted rules are not required in our proposal; instead, machine learning classifiers are implemented that learn to recognize discourse relations and predict translations of verb tenses. Firstly, we have designed new sets of semantically-oriented features and classifiers to advance the state of the art in automatic disambiguation of discourse connectives. We hereby profited from our multilingual setting and incorporated features that are based on MT and on the insights we gained from contrastive linguistic analysis of parallel corpora. In their best configurations, our classifiers reach high performances (0.7 to 1.0 F1 score) and can therefore reliably be used to automatically annotate the large corpora needed to train SMT systems. Issues of manual annotation and evaluation are discussed as well, and solutions are provided within new annotation and evaluation procedures. As a second contribution, we implemented entire SMT systems that can make use of the (automatically) annotated discourse information. Overall, the thesis confirms that these techniques are a practical solution that leads to global improvements in translation in ranges of 0.2 to 0.5 BLEU score. Further evaluation reveals that in terms of connectives and verb tenses, our statistical MT systems improve the translation of these phenomena in ranges of up to 25%, depending on the performance of the automatic classifiers and on the data sets used.

Related material