Small Languages, Big Data: Multilingual Computational Tools and Techniques for the Lexicography of Endangered Languages

The Kamusi Project, a multilingual online dictionary website, has as one of its goals to document the lexicons of en-dangered and less-resourced languages (LRLs). provides a unified platform and repository for this kind of data that is both simple to use and free to researchers and the public. Since Kamusi has a separate entry for each homophone or polyseme, it can be used to produce sophisticated multilingual dictionaries. We have recently been confronting issues inherent in contact language-based lexi-cography, especially the elicitation of culturally-specific semantic terms, which cannot be obtained through fieldwork purely reliant on a contact language. To address this, we have designed a system of “balloons.” Based on a variety of fac-tors, balloons raise the likelihood of re-vealing terms and fields that have partic-ular relevance within a culture, rather than perpetuating linguistic bias toward the concerns and artifacts of more power-ful groups. Kamusi has also developed a smartphone application which can be used for crowdsourcing contributions and validation. It will also be invaluable in gathering oral data from speakers of en-dangered languages for the production of monolingual talking dictionaries. The first of these projects is planned for the Arrernte language in central Australia.

Good, Jeff
Hirschberg, Julia
Rambow, Owen
Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages, 15-23
52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland, USA, June 22-27, 2014
Stroudsburg, PA, USA, Association for Computational Linguistics

