Learning bimodal structure in audio-visual data

Monaci, Gianluca; Vandergheynst, Pierre; Sommer, Friederich T.

doi:10.1109/TNN.2009.2032182

Monaci, Gianluca; Vandergheynst, Pierre; Sommer, Friederich T.

2009

Download

Formats

Format
BibTeX
MARC
MARCXML
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

A novel model is presented to learn bimodally informative structures from audio-visual signals. The signal is represented as a sparse sum of audio- visual kernels. Each kernel is a bimodal function consisting of synchronous snippets of an audio waveform and a spatio-temporal visual basis function. To represent an audio-visual signal, the kernels can be positioned independently and arbitrarily in space and time. The proposed algorithm uses unsupervised learning to form dictionaries of bimodal kernels from audio- visual material. The basis functions that emerge during learning capture salient audio-visual data structures. In addition it is demonstrated that the learned dictionary can be used to locate sources of sound in the movie frame. Speciﬁcally, in sequences containing two speakers the algorithm can robustly localize a speaker even in the presence of severe acoustic and visual distracters.

Details

Title Learning bimodal structure in audio-visual data

Author(s) Monaci, Gianluca ; Vandergheynst, Pierre ; Sommer, Friederich T.

Published in IEEE Transactions on Neural Networks

Volume 20

Issue 12

Pages 1898-1910

Date 2009

Keywords

sparsity; multimodal; learning; lts2; LTS2

DOI https://doi.org/10.1109/TNN.2009.2032182

Other identifier(s) View record in Web of Science

Laboratories LTS2

Record Appears in Scientific production and competences > STI - School of Engineering > IEM - Institut d'Electricité et de Microtechnique > LTS2 - Signal Processing Laboratory 2
Peer-reviewed publications
Work produced at EPFL
Journal Articles
Published

Record creation date 2008-07-03

Actions

Preview

Select file: