HMM inference towards flexible speech recognition
One of the difficulties in Automatic Speech Recognizer (ASR) is the pronunciation variability. Each word (modeled by a baseline phonetic transcription in the ASR dictionary) can be pronounced in many different ways depending on many complex qualitative and quantitative factors such as the dialect of the speaker, the speaker's gender, the speaker's age and the difference in the vocal tract length of different speakers. This project focuses on the pronunciation modelling in order to better capture the pronunciation variability. The basic idea, based on the inference of Hidden Markov Model (HMM), is to relax the lexical constraint. For each word of the dictionary, we transform the baseline phonetic transcription to an equivalent constrained ergodic HMM. This constrained model is then iteratively relaxed to converge towards a truly ergodic HMM, capable to generate any phone sequence. At each relaxation, a pronunciation model (or many pronunciation models if the HMM inference is tested on many utterances of the word) is inferred by the Viterbi algorithm. Next, the performance of this inferred model is measured in terms of confidence measure (showing how well the inferred model matches with acoustic data) and by a Levenshtein distance (showing how much the inferred model diverges from the baseline phonetic transcription). The method is tested on a list of 75 English words of the PhoneBook Database. We observe that, for many of them, the baseline phonetic transcription is a good pronunciation model since it is stable across many relaxations. It means that such baseform is robust to the pronunciation variability . Next, we also observe that, we can infer a new pronunciation model, close to the baseform in terms of phone sequence and also stable when the constrained ergodic HMM is relaxed. In this case, the solution is to include this inferred model ( the baseform model) in the dictionary. For few words, the baseform could not be suitable for many speakers (low matching with acoustic data and high divergence). Finally, the project is done in the context of hybrid HMM/ANN recognizer (using Artificial Neural Networks (ANN) to estimate local posterior probabilities). Additionally, we compare, with the HMM inference technique, two ASR systems namely baseline system (trained with standard features) and pitch-based system. We observe that the pitch-based MLP not only improves the matching between the acoustic data and the pronunciation model but also the stability of the baseform pronunciation model.