Conference paper

Small Languages, Big Data: Multilingual Computational Tools and Techniques for the Lexicography of Endangered Languages

The Kamusi Project, a multilingual online dictionary website, has as one of its goals to document the lexicons of en-dangered and less-resourced languages (LRLs). provides a unified platform and repository for this kind of data that is both simple to use and free to researchers and the public. Since Kamusi has a separate entry for each homophone or polyseme, it can be used to produce sophisticated multilingual dictionaries. We have recently been confronting issues inherent in contact language-based lexi-cography, especially the elicitation of culturally-specific semantic terms, which cannot be obtained through fieldwork purely reliant on a contact language. To address this, we have designed a system of “balloons.” Based on a variety of fac-tors, balloons raise the likelihood of re-vealing terms and fields that have partic-ular relevance within a culture, rather than perpetuating linguistic bias toward the concerns and artifacts of more power-ful groups. Kamusi has also developed a smartphone application which can be used for crowdsourcing contributions and validation. It will also be invaluable in gathering oral data from speakers of en-dangered languages for the production of monolingual talking dictionaries. The first of these projects is planned for the Arrernte language in central Australia.

Related material