Nonlinear feature transformations for noise robust speech recognition

Ikbal, Shajith

doi:10.5075/epfl-thesis-3125

doctoral thesis

Nonlinear feature transformations for noise robust speech recognition

2004

Robustness against external noise is an important requirement for automatic speech recognition (ASR) systems, when it comes to deploying them for practical applications. This thesis proposes and evaluates new feature-based approaches for improving the ASR noise robustness. These approaches are based on nonlinear transformations that, when applied to the spectrum or feature, aim to emphasize the part of the speech that is relatively more invariant to noise and/or deemphasize the part that is more sensitive to noise. Spectral peaks constitute high signal-to-noise ratio part of the speech. Thus an efficient parameterization of the components only from the peak locations is expected to improve the noise robustness. An evaluation of this requires estimation of the peak locations. Two methods proposed in this thesis for the peak estimation task are: 1) frequency-based dynamic programming (DP) algorithm, that uses the spectral slope values of single time frame, and 2) HMM/ANN based algorithm, that uses distinct time-frequency (TF) patterns in the spectrogram (thus imposing temporal constraints during the peak estimation). The learning of the distinct TF patterns in an unsupervised manner makes the HMM/ANN based algorithm sensitive to energy fluctuations in the TF patterns, which is not the case with frequency-based DP algorithm. For an efficient parameterization of spectral components around the peak locations, parameters describing activity pattern (energy surface) within local TF patterns around the spectral peaks are computed and used as features. These features, referred to as spectro-temporal activity pattern (STAP) features, show improved noise robustness, however they are inferior to the standard features in clean speech. The main reason for this is the complete masking of the non-peak regions in the spectrum, which also carry significant information required for clean speech recognition. This leads to a development of a new approach that utilizes a soft-masking procedure instead of discarding the non-peak spectral components completely. In this approach, referred to as phase autocorrelation (PAC) approach, the noise robustness is actually addressed in the autocorrelation domain (time-domain Fourier equivalent of the power spectral domain). It uses phase (i.e., angle) variation of the signal vector over time as a measure of correlation, as opposed to the regular autocorrelation which uses dot product. This alternative measure of autocorrelation is referred to as PAC, and is motivated by the fact that angle gets less disturbed by the additive disturbances than the dot product. Interestingly, the use of PAC has an effect of emphasizing the peaks and smoothing out the valleys, in the spectral domain, without explicitly estimating the peak locations. PAC features exhibit improved noise robustness. However, even the soft masking strategy tends to degrade the clean speech recognition performance. This points to the fact that externally designed transformations, which do not take a complete account of underlying complexity of the speech signal, may not be able to improve the robustness without hurting the clean speech recognition. A better approach in this case will be to learn the transformation from the speech data itself in a data-driven manner, compromising between improving the noise robustness while keeping the clean performance intact. An existing data-driven approach called TANDEM is analyzed to validate this. In TANDEM approach, a multi-layer perceptron (MLP) used to perform a data-driven transformation of the input features, learns the transformation by getting trained in a supervised, discriminative mode, with phoneme labels as output classes. Such a training makes the MLP to perform a nonlinear discriminant analysis in the input feature space and thus makes it to learn a transformation that projects the input features onto a sub-space of maximum class discriminatory information. This projection is able to suppress the noise related variability, while keeping the speech discriminatory information intact. An experimental evaluation of the TANDEM approach shows that it is effective in improving the noise robustness. Interestingly, TANDEM approach is able to further improves the noise robustness of the STAP and PAC features, and also improve their clean speech recognition performance. The analysis of noise robustness of TANDEM has also lead to another interesting aspect of it namely, using it as an integration tool for adaptively combining multiple feature streams. The validity of the various noise robust approaches developed in this thesis is shown by evaluating them on OGI Numbers95 database added with noises from Noisex92, and also with Aurora-2 database. A combination of robust features developed in this thesis along with standard features, in a TANDEM framework, result in a system that is reasonably robust in all conditions.

Name

EPFL_TH3125.pdf

Access type

restricted

Size

1.92 MB

Format

Adobe PDF

Checksum (MD5)

d90c3d71cc3eb34d4adde011e0886965