Speaker recognition in noisy environments using auxiliary information and Bayesian networks

Speaker recognition systems achieve acceptable performance in controlled laboratory conditions. However, in real-life environments, the performance of a speaker recognition system degrades drastically, the principal cause being the mismatch that exists between the testing and the training recording conditions. Indeed, the degradations introduced by the background noise as well as the distortions produced by the transmission channel are the two main factors for creating this mismatch between testing and training recordings. In spite of the major advances in the speaker recognition field, no optimal solution has been found yet for coping with this problem. This thesis proposes new methods for speaker recognition systems that make use of auxiliary information in order to reduce the influence of background noise and transmission channel distortions. For this purpose, statistical models capable of taking into account several sources of information in a unified framework are provided. While most state-of-the-art speaker recognition systems use spectral envelope features alone, the use of other features can complete the information about the speaker's individuality, but also give information about the conditions under which the testing process takes place. Both informations can help to improve the performance of the speaker recognition system. In this thesis, we focus on three auxiliary sources of information: the pitch, the voicing status and the reliability status of the spectral envelope features. These auxiliary features are used together with the spectral envelope features. The algorithms to efficiently extract the pitch and the voicing status from noisy telephone quality speech are developed. The algorithm for extracting the reliability status of spectral envelope features is also provided. Two new statistical modeling approaches for handling auxiliary sources of informations are proposed: the state-dependent transitions (SDT) model and the state-dependent states (SDS) model. Both models take into account the temporal dependencies between features of a given source of information, and also the dependencies between features that belong to different sources of information. Speaker identification experiments were conducted for evaluating the SDT modeling approach. Experiments were also performed for evaluating the novel pitch-dependent GMMs system, which is based on the SDS modeling approach. The results of all these experiments show that the modeling techniques proposed in this thesis are capable of capturing the key characteristics of the speech features and their dependencies. The concept of conditional independence and the use of conditional models are important in SDT and SDS models. One of the major drawbacks of these models is that the dependencies between features are fixed. To eliminate this drawback, a more flexible approach using Bayesian networks is introduced. Bayesian networks, have the ability to manage the dependencies between features via conditional models and to handle the relationships of conditional independence between features. We show in this thesis how Bayesian networks can complete and substitute SDS and SDT models. Two Bayesian network based systems are presented for handling auxiliary information in speaker recognition. The first one uses the pitch, the voicing status and the spectral envelope features, the second completes the first one by adding the reliability status to the set of features. Both proposed systems were compared to a GMM-UBM (Gaussian mixture model - universal background model) baseline system. Experiments were performed for evaluating the proposed approaches in noisy conditions as well as when using different transmission channels for testing and training the speaker models. The results obtained show that the Bayesian network based systems using auxiliary information outperform the classical GMM-UBM system which uses only spectral envelope features. The Bayesian network based systems proposed in this thesis, effectively reduce the influence of noise and transmission channel mismatch.


Related material