Journal article

Aural and automatic forensic speaker recognition in mismatched conditions

In this article, we compare aural and automatic speaker recognition in the context of forensic analyses, using a Bayesian framework for the interpretation of evidence. We use perceptual tests performed by non-experts and compare their performance with that of an automatic speaker recognition system. These experiments are performed with 90 phonetically untrained subjects. Several forensic cases were simulated, using the Polyphone IPSC-02 database, varying in linguistic content and technical conditions of recording. We estimate the strength of evidence for both humans and the baseline automatic system, calculating likelihood ratios using perceptual scores for humans and log-likelihood scores for the automatic system. A methodology analogous to the Bayesian interpretation in forensic automatic speaker recognition is applied to the perceptual scores given by humans in order to estimate the strength of evidence. The degradation of the accuracy of human recognition in mismatched recording conditions is contrasted with that of the automatic system under similar recording conditions. The conditions considered are fixed telephone, cellular telephone and noisy speech in forensically realistic conditions. The perceptual cues that the human subjects use to perceive differences in voices are studied, along with their importance in different recording conditions. We observe that while automatic speaker recognition shows higher accuracy in matched conditions of training and testing, its performance degrades significantly in mismatched conditions. Aural recognition accuracy is also observed to degrade from matched conditions to mismatched conditions and in mismatched conditions, the baseline automatic systems showed comparable or slightly degraded performance compared to the aural recognition systems. The baseline automatic system with adaptation to noisy conditions showed comparable or better performance than aural recognition. The higher level perceptual cues used by human listeners in order to recognise speakers are discussed. We also discuss the possibility of increasing the accuracy of automatic systems using the perceptual cues that remain robust to mismatched recording conditions.


Related material


EPFL authors