Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR

In this article we review several successful extensions to the standard Hidden-Markov-Model/Artificial Neural Network (HMM/ANN) hybrid, which have recently made important contributions to the field of noise robust automatic speech recognition. The first extension to the standard hybrid was the ``multi-band hybrid'', in which a separate ANN is trained on each frequency subband, followed by some form of weighted combination of \ANN state posterior probability outputs prior to decoding. However, due to the inaccurate assumption of subband independence, this system usually gives degraded performance, except in the case of narrow-band noise. All of the systems which we review overcome this independence assumption and give improved performance in noise, while also improving or not significantly degrading performance with clean speech. The ``all-combinations multi-band'' hybrid trains a separate ANN for each subband combination. This, however, typically requires a large number of ANNs. The ``all-combinations multi-stream'' hybrid trains an ANN expert for every combination of just a small number of complementary data streams. Multiple ANN posteriors combination using maximum a-posteriori (MAP) weighting gives rise to the further successful strategy of hypothesis level combination by MAP selection. An alternative strategy for exploiting the classification capacity of ANNs is the ``tandem hybrid'' approach in which one or more ANN classifiers are trained with multi-condition data to generate discriminative and noise robust features for input to a standard ASR system. The ``multi-stream tandem hybrid'' trains an ANN for a number of complementary feature streams, permitting multi-stream data fusion. The ``narrow-band tandem hybrid'' trains an ANN for a number of particularly narrow frequency subbands. This gives improved robustness to noises not seen during training. Of the systems presented, all of the multi-stream systems provide generic models for multi-modal data fusion. Test results for each system are presented and discussed

Related material