Infoscience

Thesis

Forensic automatic speaker recognition using Bayesian interpretation and statistical compensation for mismatched conditions

Nowadays, state-of-the-art automatic speaker recognition systems show very good performance in discriminating between voices of speakers under controlled recording conditions. However, the conditions in which recordings are made in investigative activities (e.g., anonymous calls and wire-tapping) cannot be controlled and pose a challenge to automatic speaker recognition. Differences in the phone handset, in the transmission channel and in the recording devices can introduce variability over and above that of the voices in the recordings. The strength of evidence, estimated using statistical models of within-source variability and between-sources variability, is expressed as a likelihood ratio, i.e., the probability of observing the features of the questioned recording in the statistical model of the suspected speaker's voice, given the two competing hypotheses: the suspected speaker is the source of the questioned recording and the speaker at the origin of the questioned recording is not the suspected speaker. The main unresolved problem in forensic automatic speaker recognition today is that of handling mismatch in recording conditions. Mismatch in recording conditions has to be considered in the estimation of the likelihood ratio. The research in this thesis mainly addresses the problem of the erroneous estimation of the strength of evidence due to the mismatch in technical conditions of encoding, transmission and recording of the databases used in a Bayesian interpretation framework. We investigate three main directions in applying the Bayesian interpretation framework to forensic automatic speaker recognition casework. The first addresses the problem of mismatched recording conditions of the databases used in the analysis. The second concerns introducing the Bayesian interpretation methodology to aural-perceptual speaker recognition as well as comparing aural-perceptual tests performed by laypersons with an automatic speaker recognition system, in matched and mismatched recording conditions. The third addresses the problem of variability in estimating the likelihood ratio, and several new solutions to cope with this variability are proposed. Firstly, we propose a new approach to estimate and statistically compensate for the effects of mismatched recording conditions using databases, in order to estimate parameters for scaling distributions to compensate for mismatch, called "scaling databases". These scaling databases reduce the need for recording large databases for potential populations in each recording condition, which is both expensive and time consuming. The compensation method is based on the principal Gaussian component in the distributions. The error in the likelihood ratios obtained after compensation increases with the deviation of the score distributions from the Gaussian distribution. We propose guidelines for the creation of a database that can be used in order to estimate and compensate for mismatch, and create a prototype of this database to validate the methodology for compensation. Secondly, we analyze the effect of mismatched recording conditions on the strength of evidence, using both aural-perceptual and automatic speaker recognition methods. We have introduced the Bayesian interpretation methodology to aural-perceptual speaker recognition from which likelihood ratios can be estimated. It was experimentally observed that in matched recording conditions of suspect and questioned recordings, the automatic systems showed better performance than the aural recognition systems. In mismatched conditions, however, the baseline automatic systems showed a comparable or slightly degraded performance as compared to the aural recognition systems. Adapting the baseline automatic system to mismatch showed comparable or better performance than aural recognition in the same conditions. Thirdly, in the application of Bayesian interpretation to real forensic case analysis, we propose several new solutions for the analysis of the variability of the strength of evidence using bootstrapping techniques, statistical significance testing and confidence intervals, and multivariate extensions of the likelihood ratio for handling cases where the suspect data is limited. In order for forensic automatic speaker recognition to be acceptable for presentation in the courts, the methodologies and techniques have to be researched, tested and evaluated for error, as well as be generally accepted in the scientific community. The methodology presented in this thesis is viewed in the light of the Daubert (USA, 1993) ruling for the admissibility of scientific evidence.

Thèse École polytechnique fédérale de Lausanne EPFL, n° 3367 (2005)
Section de génie électrique et électronique
Faculté des sciences et techniques de l'ingénieur
Institut de traitement des signaux
Jury: Pierre Margot, Daniel Mlynek, Philip Rose, Jean-Philippe Thiran

Public defense: 2005-11-18

Reference

Record created on 2005-10-12, modified on 2013-10-02

Fulltext

Contacts

EPFL authors