Infoscience

Thesis

# Reconnaissance et Transformation de Locuteurs

The generic goal of the present PhD thesis is to understand how to analyse, decompose, model and transform the vocal identity of a speaker as seen through an automatic speaker recognition application, in view of improving current state-of-the-art speaker verification approaches. The Thesis starts with an introduction discussing the properties of the speech signal and the basis of state-of-the-art automatic speaker recognition systems. The errors of an operating speaker recognition application are then analysed. From the deficiencies and mistakes observed in a typical application, conclusions are drawn which imply a re-evaluation of the characteristic parameters of a speaker, and the modification of some parts of the automatic speaker recognition chain. Starting from the speech signal, the speaker characteristic parameters are extracted using an analysis and synthesis harmonic plus noise model (HN). The analysis and re-synthesis of the harmonic and noise parts indicate those parameters which are speech or speaker dependent. It is then shown that the speaker discriminant information can be found by subtracting the HN modeled signal from the original signal. A study of the impostor modeling, essential in the tuning of a speaker recognition system, is then carried out. The impostors are simulated in two ways. First by a transformation of the speech of a source speaker (the impostor) to the speech of a target speaker (the client) using the parameters extracted from the HN model. This way of transforming the parameters is efficient as the false acceptance rate grows from 4% to 23%. Second, an automatic imposture by speech segment concatenation is carried out. In this case the false acceptance rate grows to 30%. A way to become less sensitive to the spectral modification impostures is to remove the harmonic part or even the noise part modeled by the HN from the original signal. Using such a subtraction decreases the false acceptance rate to 8% even if transformed impostors are used. To overcome the lack of training data (one of the main cause of modeling errors in speaker recognition), a decomposition of the recognition task into a set of binary classifiers is proposed. A classifier matrix is built and each of its elements has to discriminate between the data coming from the client and another speaker (referred to as anti-speaker'') randomly chosen. With such an approach, it is possible to weight the results according to the vocabulary or the neighbors of the client in the parameter (acoustic) space. The outputs of all the binary classifiers (matrix classifiers) are then combined according to a weighted sum to produce a single output score for each client input. The weights are estimated on an independent validation set to minimize the overlap between the client and impostors densities. It is shown that the binary pair speaker recognition system usually performs better that a state-of-the art HMM based system (especially in the case of a priori threshold). In order to set a point of operation (i.e., a point on the COR curve) for the speaker recognition application, an {\it a priori} threshold has to be determined. Theoretically, the threshold should be speaker independent when stochastic models are used. However, practical experiments show that this is not the case and, due to modeling assumptions, the threshold actually becomes speaker and utterance length dependent. A theoretical framework showing how to adjust the threshold using the local likelihood ratio is then developed. Finally, a further modeling error correction approach is proposed and tested using decision fusion. Practical experiments show the advantages and drawbacks of the fusion approach.

#### Reference

Record created on 2010-02-11, modified on 2017-05-10