Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media

Le, Nam; Odobez, Jean-Marc

doi:10.1145/2964284.2967211

Le, Nam; Odobez, Jean-Marc

2016

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

Person discovery in the absence of prior identity knowledge requires accurate association of visual and auditory cues. In broadcast data, multimodal analysis faces additional challenges due to narrated voices over muted scenes or dubbing in different languages. To address these challenges, we define and analyze the problem of dubbing detection in broadcast data, which has not been explored before. We propose a method to represent the temporal relationship between the auditory and visual streams. This method consists of canonical correlation analysis to learn a joint multimodal space, and long short term memory (LSTM) networks to model cross-modality temporal dependencies. Our contributions also include the introduction of a newly acquired dataset of face-speech segments from TV data, which we have made publicly available. The proposed method achieves promising performance on this real world dataset as compared to several baselines.

Details

Title Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media

Author(s) Le, Nam ; Odobez, Jean-Marc

Published in Mm'16: Proceedings Of The 2016 Acm Multimedia Conference

Pagination 5

Pages 202-206

Conference ACM Multimedia, Amsterdam

Date 2016

Publisher New York, ACM

ISBN 978-1-4503-3603-1

Keywords

Multimodal; Person Diarization; Recurrent Neural Networks

DOI https://doi.org/10.1145/2964284.2967211

Other identifier(s) View record in Web of Science

Laboratories LIDIAP

Record Appears in Scientific production and competences > STI - School of Engineering > IEM - Institut d'Electricité et de Microtechnique > LIDIAP - L'IDIAP Laboratory
Scientific production and competences > Euler Center for Signal Processing
Peer-reviewed publications
Conference Papers
Work produced at EPFL
Published

Record creation date 2016-08-19

Files

Abstract

Details

PDF