Joint speech and speaker recognition

BenZeghiba, Mohamed Faouzi

doi:10.5075/epfl-thesis-3193

doctoral thesis

Joint speech and speaker recognition

2005

The goal of the thesis is to investigate different approaches that combine and integrate Automatic Speech Recognition (ASR) and Speaker Recognition (SR) systems, with applications to (1) User-Customized Password Speaker Verification (UCP-SV) systems, and, (2) joint speech and speaker recognition. Unlike text-dependent speaker verification systems, in UCP-SV systems, customers can choose easily their own password, which has to be pronounced a few times during enrollment to create a customer specific model that will be subsequently used for verification. The main assumption in such systems is that no a priori knowledge about the password (such as its phonetic transcription) is available. However, although more user-friendly and more secure, UCP-SV systems are less understood and actually exhibit several new challenges, including: automatic inference of Hidden Markov Model (HMM) password (using a speaker-independent ASR system), fast speaker adaptation of the resulting acoustic models, score normalization, and verification of both lexical and speaker characteristics. Development and evaluation of such systems are then based on their ability to jointly verify: (1) the identity of a claimed speaker, (2) pronouncing the correct password, and thus rejecting all other possible alternatives. In this thesis, two different speaker acoustic modeling approaches are investigated: HMM/GMM approach (based on Gaussian Mixture Model, GMM) and hybrid HMM/MLP approach (based on Multi-Layer Perceptron, MLP). In the case of HMM/GMM approach, the background model used for likelihood normalization was the main difficulty, and several solutions were investigated to improve the baseline system. In the HMM/MLP approach, MLP adaptation was also a problem. In that context, we found that the modeling capability of the adapted MLP was more towards learning the lexical content of the password than the customer's voice characteristics. Therefore, a probabilistic framework that combines the hybrid HMM/MLP systems and GMM is proposed and extensively investigated. In this case, the HMM/MLP system is used for utterance verification, while GMM is used for speaker verification. Since UCP-SV involves both speech recognition (ASR) and speaker verification (SV), a natural extension of our work was to also investigate new approaches towards using ASR together with Speaker Recognition (SR) to improve both ASR and SR systems. In this framework, we show in this thesis that optimization and recognition based on a joint ASR-SR posterior probability criterion yields better ASR and SR performance, beyond what could be achieved from the two systems independently, as well as from a "sequential" approach (e.g., first performing speaker identification/ clustering, followed by speech recognition). This work resulted in a PC-based real time implementation of an HMM based UCP-SV system available for demonstration.

Use this identifier to reference this record

https://infoscience.epfl.ch/handle/20.500.14299/212569

Name

EPFL_TH3193.pdf

Access type

restricted

Size

6.83 MB

Format

Adobe PDF

Checksum (MD5)

4c9ca363974a900aa2b105d9c90be412