Novel speech processing techniques for robust automatic speech recognition
The goal of this thesis is to develop and design new feature representations that can improve the automatic speech recognition (ASR) performance in clean as well noisy conditions. One of the main shortcomings of the fixed scale (typically 20-30 ms long analysis windows) envelope based feature such as MFCC, is their poor handling of the non-stationarity of the underlying signal. In this thesis, a novel stationarity-synchronous speech spectral analysis technique has been proposed that sequentially detects the largest quasi-stationary segments in the speech signal (typically of variable lengths varying from 20-60 ms), followed by their spectral analysis. In contrast to a fixed scale analysis technique, the proposed technique provides better time and frequency resolution, thus leading to improved ASR performance. Moving a step forward, this thesis then outlines the development of theoretically consistent amplitude modulation and frequency modulation (AM-FM) techniques for a broad band signal such as speech. AM-FM signals have been well defined and studied in the context of communications systems. Borrowing upon these ideas, several researchers have applied AM-FM modeling for speech signals with mixed results. These techniques have varied in their definition and consequently the demodulation methods used therein. In this thesis, we carefully define AM and FM signals in the context of ASR. We show that for a theoretically meaningful estimation of the AM signals, it is important to constrain the companion FM signal to be narrow-band. Due to the Hilbert relationships, the AM signal induces a component in the FM signal which is fully determinable from the AM signal and hence forms the redundant information. We present a novel homomorphic filtering technique to extract the leftover FM signal after suppressing the redundant part of the FM signal. The estimated AM message signals are then down-sampled and their lower DCT coefficients are retained as speech features. We show that this representation is, in fact, the exact dual of the real cepstrum and hence, is referred to as fepstrum. While Fepstrum provides amplitude modulations (AM) occurring within a single frame size of 100ms, the MFCC feature provides static energy in the Mel-bands of each frame and its variation across several frames (the deltas). Together these two features complement each other and the ASR experiments (hidden Markov model and Gaussian mixture model (HMM-GMM) based) indicate that Fepstrum feature in conjunction with MFCC feature achieve significant ASR improvement when evaluated over several speech databases. The second half of this thesis deals with the noise robust feature extraction techniques. We have designed an adaptive least squares filter (LeSF) that enhances a speech signal corrupted by broad band noise that can be non-stationary. This technique exploits the fact that the autocorrelation coefficients of a broad-band noise decay much more rapidly with increasing time lag as compared to those of the speech signal. This is especially true for voiced speech as it consists of several sinusoids at the multiples of the fundamental frequency. Hence the autocorrelation coefficients of the voiced speech are themselves periodic with period equal to the pitch period. On the other hand, the autocorrelation coefficients of a broad band noise are rapidly decaying with increasing time lag. Therefore, a high order (typically 100 tap) least square filter that has been designed to predict a noisy speech signal (speech + additive broad band noise) will predict more of the clean speech components than the broad band noise. This has been analytically proved in this thesis and we have derived analytic expressions for the noise rejection achieved by such a least squares filter. This enhancement technique has led to significant ASR accuracy in the presence of real life noises such as factory noise and aircraft cockpit noise. Finally, the last two chapters of this thesis deal with feature level noise robustness technique. Unlike the least squares filtering that enhances the speech signal itself (in the time domain), the feature level noise robustness techniques as such do not enhance the speech signal but rather boosts the noise-robustness of the speech features that usually are non-linear functions of the speech signal's power spectrum. The techniques investigated in this thesis provided a significant improvement in the ASR performance for the clean as well noisy acoustic conditions.
EPFL_TH3637.pdf
openaccess
932.65 KB
Adobe PDF
cc9a83d69a8c3a0d3f56907738ac98fc