Neural Network based Regression for Robust Overlapping Speech Recognition using Microphone Arrays

This paper investigates a neural network based acoustic feature mapping to extract robust features for automatic speech recognition (ASR) of overlapping speech. In our preliminary studies, we trained neural networks to learn the mapping from log mel filter bank energies (MFBEs) extracted from the distant microphone recordings, including multiple overlapping speakers, to log MFBEs extracted from the clean speech signal. In this paper, we explore the mapping of higher order mel-filterbank cepstral coefficients (MFCC) to lower order coefficients. We also investigate the mapping of features from both target and interfering distant sound sources to the clean target features. This is achieved by using the microphone array to extract features from both the direction of the target and interfering sound sources. We demonstrate the effectiveness of the proposed approach through extensive evaluations on the MONC corpus, which includes both non-overlapping single speaker and overlapping multi-speaker conditions.

Related material