Enhancing posterior based speech recognition systems

The use of local phoneme posterior probabilities has been increasingly explored for improving speech recognition systems. Hybrid hidden Markov model / artificial neural network (HMM/ANN) and Tandem are the most successful examples of such systems. In this thesis, we present a principled framework for enhancing the estimation of local posteriors, by integrating phonetic and lexical knowledge, as well as long contextual information. This framework allows for hierarchical estimation, integration and use of local posteriors from the phoneme up to the word level. We propose two approaches for enhancing the posteriors. In the first approach, phoneme posteriors estimated with an ANN (particularly multi-layer Perceptron – MLP) are used as emission probabilities in HMM forward-backward recursions. This yields new enhanced posterior estimates integrating HMM topological constraints (encoding specific phonetic and lexical knowledge), and long context. In the second approach, a temporal context of the regular MLP posteriors is post-processed by a secondary MLP, in order to learn inter and intra dependencies among the phoneme posteriors. The learned knowledge is integrated in the posterior estimation during the inference (forward pass) of the second MLP, resulting in enhanced posteriors. The use of resulting local enhanced posteriors is investigated in a wide range of posterior based speech recognition systems (e.g. Tandem and hybrid HMM/ANN), as a replacement or in combination with the regular MLP posteriors. The enhanced posteriors consistently outperform the regular posteriors in different applications over small and large vocabulary databases.

Related material