Journal article

Template-matching for text-dependent speaker verification

In the last decade, i-vector and Joint Factor Analysis (JFA) approaches to speaker modeling have become ubiquitous in the area of automatic speaker recognition. Both of these techniques involve the computation of posterior probabilities, using either Gaussian Mixture Models (GMM) or Deep Neural Networks (DNN), as a prior step to estimating i-vectors or speaker factors. GMMs focus on implicitly modeling phonetic information of acoustic features while DNNs focus on explicitly modeling phonetic/linguistic units. For text-dependent speaker verification, DNN-based systems have considerably outperformed GMM for fixed-phrase tasks. However, both approaches ignore phone sequence information. In this paper, we aim at exploiting this information by using Dynamic Time Warping (DTW) with speaker-informative features. These features are obtained from i-vector models extracted over short speech segments, also called online i-vectors. Probabilistic Linear Discriminant Analysis (PLDA) is further used to project online i-vectors onto a speaker-discriminative subspace. The proposed DTW approach obtained at least 74% relative improvement in equal error rate on the RSR corpus over other state-of-the-art approaches, including i-vector and JFA.


Related material