Files

Abstract

State-of-the-art acoustic models for Automatic Speech Recognition (ASR) are based on Hidden Markov Models (HMM) and Deep Neural Networks (DNN) and often require thousands of hours of transcribed speech data during training. Therefore, building multilingual ASR systems or systems on a language with few resources is a challenging task. Multilingual training and cross-lingual adaptation are potential solutions. However, context-dependent states modeling creates difficulties for multilingual and cross-lingual ASR because of the large increase in context dependent labels arising from the phone set mismatch. The goal of this thesis is to improve current state-of-the-art acoustic modeling techniques in general for ASR, with a particular focus on multilingual ASR and cross-lingual adaptation. We systematically exploited new training frameworks, from Maximum Likelihood Estimation, Connectionist Temporal Classification to Maximum Mutual Information, in the context of phoneme-based multilingual training. In order to minimize the negative effects of data impurity arising from language mismatch, we investigated language adaptive training approaches which help further improve the multilingual ASR performance. Through comprehensive experimental comparison we demonstrated that phoneme-based multilingual models are easily extensible to unseen phonemes of new languages, from which the cross-lingual adaptation yields significant improvement over traditional approaches on limited data. Finally, we proposed a semi-supervised training approach based on dropout to boost the performance in low-resourced languages using untranscribed data. In the other part of the thesis, we conducted more theoretical analysis of techniques found to be useful in sequential multilingual training. More specifically, we revisited the recurrent architecture based on Bayes’s theorem. This leads to a Bayesian recurrent unit dictated by the probabilistic formulation and naturally support a backward recursion. Experiments show that the proposed architecture exceeds the performance of conventional recurrent network. Together, this thesis constitutes a thorough analysis of the current field. Through theoretical and experimental comparisons, the proposed approaches are shown to yield significant improvement over the conventional hybrid systems on multilingual speech recognition.

Details

PDF