Tampered Speaker Inconsistency Detection with Phonetically Aware Audio-visual Features

Korshunov, Pavel; Halstead, Michael; Castan, Diego; Graciarena, Martin; McLaren, Mitchell; Burns, Brian; Lawson, Aaron; Marcel, Sébastien

conference paper

Korshunov, Pavel

•

Halstead, Michael

•

Castan, Diego

2019

International Conference on Machine Learning

The recent increase in social media based propaganda, i.e., ‘fake news’, calls for automated methods to detect tampered content. In this paper, we focus on detecting tampering in a video with a person speaking to a camera. This form of manipulation is easy to perform, since one can just replace a part of the audio, dramatically chang- ing the meaning of the video. We consider several detection approaches based on phonetic features and recurrent networks. We demonstrate that by replacing standard MFCC features with embeddings from a DNN trained for automatic speech recognition, combined with mouth landmarks (visual features), we can achieve a significant performance improvement on several challenging publicly available databases of speakers (VidTIMIT, AMI, and GRID), for which we generated sets of tampered data. The evaluations demonstrate a relative equal error rate reduction of 55% (to 4.5% from 10.0%) on the large GRID corpus based dataset and a satisfying generalization of the model on other datasets.

Type

conference paper

Author(s)

Korshunov, Pavel

Halstead, Michael

Castan, Diego

Graciarena, Martin

McLaren, Mitchell

Burns, Brian

Lawson, Aaron

Marcel, Sébastien

Date Issued

2019

Series title/Series vol.

Synthetic Realities: Deep Learning for Detecting AudioVisual Fakes

Subjects

inconsistencies detection

•

lip-syncing

•

Video tampering

Note

Best paper award in ICML workshop "Synthetic Realities: Deep Learning for Detecting AudioVisual Fakes"

URL