Résumé

We propose an information theoretic framework for quantitative assessment of acoustic models used in hidden Markov model (HMM) based automatic speech recognition (ASR). The HMM backend expects that (i) the acoustic model yields accurate state conditional emission probabilities for the observations at each time step, and (ii) the conditional probability distribution of the data given the underlying hidden state is independent of any other state in the sequence. The latter property is also known as the Markovian conditional independence assumption of HMM based modeling. In this work, we cast HMM based ASR as a communication channel in which the acoustic model computes the state emission probabilities as the input of the channel and the channel outputs the most probable hidden state sequence. The quality of the acoustic model is thus quantified in terms of the amount of information transmitted through this channel as well as how robust this channel is against the mismatch between the data and HMM's conditional independence assumption. To formulate the required information theoretic terms, we utilize the gamma posterior (or state occupancy) probabilities of HMM hidden states to derive a simple and straightforward analysis framework which assesses the benefits and shortcomings of various acoustic models in HMM based ASR. Our approach enables us to analyse acoustic modeling with Gaussian mixture models (GMM) as well as deep neural networks (DNN) (with different number of hidden layers) without actually evaluating their ASR performance explicitly. As use cases, we apply our analysis on sequence discriminatively trained DNN acoustic models as well as state-of-the-art recurrent and time-delay neural networks to compare their efficacy as acoustic models in HMM based ASR. In addition, we also use our analysis to study the contribution of sparse and low-dimensional models in enhancing acoustic modeling for better compliance with the HMM requirements.

Détails