Files

Abstract

The goal of this thesis is to improve current state-of-the-art techniques in speaker verification (SV), typically based on “identity-vectors” (i-vectors) and deep neural network (DNN), by exploiting diverse (phonetic) information extracted using various techniques such as automatic speech recognition (ASR). Different speakers span different subspaces within a universal acoustic space, usually modelled by “universal background model”. The speaker-specific subspace depends on the speaker’s voice characteristics, but also on the verbalised text of a speaker. In current state-of-the-art SV systems, i-vectors are extracted by applying a factor analysis technique to obtain low dimensional speaker-specific representation. Furthermore, DNN output is also employed in a conventional i-vector framework to model phonetic information embedded in the speech signal. This thesis proposes various techniques to exploit phonetic knowledge of speech to further enrich speaker characteristics. More specifically, the techniques proposed in this thesis are applied to various SV tasks, namely, text-independent and text-dependent SV. For text-independent SV task, several ASR systems are developed and applied to compute phonetic posterior probabilities, subsequently exploited to enhance the speaker-specific information included in i-vectors. These approaches are then extended for text-dependent SV task, exploiting temporal information in a principled way, i.e., by using dynamic time warping applied on speaker informative vectors. Finally, as opposed to train DNN with phonetic information, DNN is trained in an end-to-end fashion to directly discriminate speakers. The baseline end-to-end SV approach consists of mapping a variable length speech segment to a fixed dimensional speaker vector by estimating the mean of hidden representations in DNN structure. We improve upon this technique by computing a distance function between two utterances which takes into account common phonetic units. The whole network is optimized by employing a triplet-loss objective function. The proposed approaches are evaluated on commonly used datasets such as NIST SRE 2010 and RSR2015. Significant improvements are observed over the baseline systems on both the text-dependent and text-independent SV tasks by applying phonetic knowledge.

Details

Actions

Preview