Kamusi Pre:D – Lexicon-based source-side predisambiguation for MT and other text processing applications

Benjamin, Martin

Benjamin, Martin

2016

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

Kamusi has been developing a system to analyze texts on the source side and present users with sense-specified dictionary options. Similarly to spellcheck, the user selects the intended meaning. We then use a multilingual lexical database to bridge to matching vocabulary in other languages. When paired with Freeling, additional pre-processing is possible for several languages. Integration with MT via Moses and Apertium is planned, but not yet undertaken. MWEs treatment is important. An MWE is lexicalized in the Kamusi database and marked for separability, with a definition and translation equivalents (one or more words) in other languages. When the initial term of an MWE appears in the source text, Pre:D queries the database and scans the sentence for all MWEs that could follow. The user can select the relevant MWE rather than the component words. A user can submit a missing sense or MWE for inclusion in the lexicon. Named entities can also be identified from data sources or by users and rendered appropriately across languages. When users agree, we will also use sense-tagged sentences for machine learning. A prototype of the core system is already functional.

Details

Title Kamusi Pre:D – Lexicon-based source-side predisambiguation for MT and other text processing applications

Author(s) Benjamin, Martin

Pagination 8

Date 2016

Publisher ENeL

Keywords

machine translation; multilingual lexicography; multiword expressions; word sense disambiguation; natural language processing

Note Working Paper for the European Association of e-Lexigography, "Lexicographic data meet computational linguistics and knowledge based systems", COST ENeL WG3 meeting, Brno, Czech Republic, 16-17 September 2016

Additional link URL

Laboratories LSIR

Record Appears in Scientific production and competences > I&C - School of Computer and Communication Sciences > IINFCOM > LSIR - Distributed Information Systems Laboratory
Working papers
Work produced at EPFL

Record creation date 2016-10-26

Files

Abstract

Details

PDF