Joint Speech and Speaker Recognition
The goal of the present thesis was to investigate and optimize different approaches towards User-Customized Password Speaker Verification (UCP-SV) systems. In such systems, users can choose their own passwords, which will be subsequently used for verification. The main assumption here unlike text-dependent speaker verification, is that no a priori knowledges about the password (such as its phonetic transcription) is available to the system. Speaker verification has already been widely investigated. However, although more user-friendly, UCP-SV is less understood and actually exhibits several new challenges, including: automatic inference of HMM password models (from a speaker-indepenent ASR system), fast speaker adaptation of the resulting acoustic models, score normalization, and verification (involving both text verification, i.e. ASR, and speaker verification). Evaluation of such as system is then based on its ability to simultaneously verify two hypotheses: (1) the identity of a claimed speaker, (2) pronouncing the correct password, and thus rejecting all other possible alternatives. In this thesis, two different speaker acoustic modeling approaches are investigated: HMM/GMM approach and hybrid HMM/MLP approach. Within the HMM/GMM approach, the background model used for likelihood normalization was particularly a difficult problem, and several solutions had to be investigated to improve the baseline system and to make the UCP-SV system actually practical. Within HMM/MLP approach, MLP adaptation was the main problem. We found that the modeling capabilities of the adapted MLP was more towards learning the lexical content of the password than the customer characteristics. Therefore, a probabilistic framework that combines the hybrid HMM/MLP systems and Gaussian Mixtures Models (GMM) is proposed. In this framework, the HMM/MLP system is used for utterance verification, while GMM is used for speaker verification. Experimental results showed comparable performance with HMM/GMM approach. Since UCP-SV involves both speech recognition (ASR) and speaker verification (SV), a natural extension of our work was to also investigate new approaches towards using ASR together with SV to improve both ASR and SV systems. In this thesis, we have shown that optimization and recognition based on a joint ASR-SV posterior probability was yielding better ASR and SV performance, beyond what could be achieved through a "sequential" approach (e.g., first performing speaker ID/clustering, followed by speech recognition).
rr05-28.pdf
openaccess
6.84 MB
Adobe PDF
6487645d0f34e35c0829ef2c75995517