Integrating Language Identification to improve Multilingual Speech Recognition

The process of determining the language of a speech utterance is called Language Identification (LID). This task can be very challenging as it has to take into account various language-specific aspects, such as phonetic, phonotactic, vocabulary and grammar-related cues. In multilingual speech recognition we try to find the most likely word sequence that corresponds to an utterance where the language is not known a priori. This is a considerably harder task compared to monolingual speech recognition and it is common to use LID to estimate the current language. In this project we present two general approaches for LID and describe how to integrate them into multilingual speech recognizers. The first approach uses hierarchical multilayer perceptrons to estimate language posterior probabilities given the acoustics in combination with hidden Markov models. The second approach evaluates the output of a multilingual speech recognizer to determine the spoken language. The research is applied to the MediaParl speech corpus that was recorded at the Parliament of the canton of Valais, where people switch from Swiss French to Swiss German or vice versa. Our experiments show that, on that particular data set, LID can be used to significantly improve the performance of multilingual speech recognizers. We will also point out that ASR dependent LID approaches yield the best performance due to higher-level cues and that our systems perform much worse on non-native data

Related material