In this thesis, we propose a novel approach for speaker and speech recognition involving localized, binary, data-driven features. The proposed approach is largely inspired by similar localized approaches in the computer vision domain. The success of these existing approaches coupled with their proven advantages of robustness and computational efficiency motivated us to apply these ideas to the speech domain. Our approach is distinct from the standard cepstral features-based approach for speaker and speech recognition. The proposed approach starts with a large set of simple localized features, each of which looks at very small parts of spectro-temporal representations of speech. Each feature is binary-valued. The most discriminative of these features are selected by boosting and combined to form the final classifier. Two systems are developed based on this general framework, a speaker recognition system and a speech recognition system. The speaker recognition system is evaluated under a wide range of experimental conditions, using clean speech, noisy speech and speech data collected from mobile phones. The system performs reliably in each condition, comparable with the standard systems using cepstral features and Gaussian Mixture Models. At the same time, it involves significantly lower number of floating point operations compared to these systems. In the case of the speech recognition system, we integrate our localized features with a Hidden Markov Model framework using multilayer perceptrons. Continuous speech recognition studies on standard databases show that these features perform equally well as cepstral features. It is also found that the fusion of these features with cepstral features leads to improved performance at both the feature level and the decision level. Apart from this, minor contributions include an audio-visual person recognition system developed using the same general approach of localized features described above, extending its applicability. Finally, a new (but related) class of localized features was developed for robust face detection.
EPFL_TH5212.pdf
restricted
1.35 MB
Adobe PDF
9a0de1fbb7b23d666700e28da99c64ac