Working paper

Kamusi Pre:D – Lexicon-based source-side predisambiguation for MT and other text processing applications

Kamusi has been developing a system to analyze texts on the source side and present users with sense-specified dictionary options. Similarly to spellcheck, the user selects the intended meaning. We then use a multilingual lexical database to bridge to matching vocabulary in other languages. When paired with Freeling, additional pre-processing is possible for several languages. Integration with MT via Moses and Apertium is planned, but not yet undertaken. MWEs treatment is important. An MWE is lexicalized in the Kamusi database and marked for separability, with a definition and translation equivalents (one or more words) in other languages. When the initial term of an MWE appears in the source text, Pre:D queries the database and scans the sentence for all MWEs that could follow. The user can select the relevant MWE rather than the component words. A user can submit a missing sense or MWE for inclusion in the lexicon. Named entities can also be identified from data sources or by users and rendered appropriately across languages. When users agree, we will also use sense-tagged sentences for machine learning. A prototype of the core system is already functional.

Related material