MLP-based Log Spectral Energy Mapping for Robust Overlapping Speech Recognition

This paper investigates a multilayer perceptron (MLP) based acoustic feature mapping to extract robust features for automatic speech recognition (ASR) of overlapping speech. The MLP is trained to learn the mapping from log mel filter bank energies (MFBEs) extracted from the distant microphone recordings, including multiple overlapping speakers, to log MFBEs extracted from the clean speech signal. The outputs of the MLP are then used to generate mel filterbank cepstral coefficient (MFCC) acoustic features, that are subsequently used in acoustic model adaptation and system evaluation. The proposed approach is evaluated through extensive studies on the MONC corpus, which includes both non-overlapping single speaker and overlapping multi-speaker conditions. We demonstrate that by learning the mapping between log MFBEs extracted from noisy and clean signals the performance of ASR system can be significantly improved in overlapping multi-speaker condition compared a conventional delay-sum beamforming approach, while keeping the performance of the system on single non-overlapping speaker condition intact.

Related material