Blind Audio-Visual Source Separation Using Sparse Redundant Representations

Llagostera Casanovas, A.; Monaci, G.; Vandergheynst, P.

Llagostera Casanovas, A.; Monaci, G.; Vandergheynst, P.

2006

Formats

Format
BibTeX
MARC
MARCXML
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

This report presents a new method to confront the Blind Audio Source Separation (BASS) problem, by means of audio and visual information. In a given mixture, we are able to locate the video sources first and, posteriorly, recover each source signal, only with one microphone and the associated video. The proposed model is based on the Matching Pursuit (MP) [18] decomposition of both audio and video signals into meaningful structures. Frequency components are extracted from the soundtrack, with the consequent information about energy content in the time-frequency plane of a sound. Moreover, the MP decomposition of the audio is robust in front of noise, because of its plain characteristic in this plane. Concerning the video, the temporal displacement of geometric features means movement in the image. If temporally close to an audio event, this feature points out the video structure which has generated this sound. The method we present links audio and visual structures (atoms) according to their temporal proximity, building audiovisual relationships. Video sources are identified and located in the image exploiting these connections, using a clustering algorithm that rewards video features most frequently related to audio in the whole sequence. The goal of BASS is also achieved considering the audiovisual relationships. First, the video structures close to a source are classified as belonging to it. Then, our method assigns the audio atoms according to the source of the video features related. At this point, the separation performed with the audio reconstruction is still limited, with problems when sources are active exactly at the same time. This procedure allows us to discover temporal periods of activity of each source. However, with a temporal analysis alone it is not possible to separate audio features of different sources precisely synchronous. The goal, now, is to learn the sources frequency behavior when only each one of them is active to predict those moments when they overlap. Applying a simple frequency association, results improve considerably with separated soundtracks of a better audible quality. In this report, we will analyze in depth all the steps of the proposed approach, remarking the motivation of each one of them.

Details

Title Blind Audio-Visual Source Separation Using Sparse Redundant Representations

Author(s) Llagostera Casanovas, A. ; Monaci, G. ; Vandergheynst, P.

Date 2006

Keywords

LTS2

Note ITS

Laboratories LTS2

Record Appears in Scientific production and competences > STI - School of Engineering > IEM - Institut d'Electricité et de Microtechnique > LTS2 - Signal Processing Laboratory 2
Work produced at EPFL
Student projects

Work type Master's Thesis

Record creation date 2006-10-27

Actions

Preview

Select file: