000125304 001__ 125304
000125304 005__ 20190416220440.0
000125304 0247_ $$2doi$$a10.1109/TNN.2009.2032182
000125304 02470 $$2ISI$$a000272484200004
000125304 037__ $$aARTICLE
000125304 245__ $$aLearning bimodal structure in audio-visual data
000125304 269__ $$a2009
000125304 260__ $$c2009
000125304 336__ $$aJournal Articles
000125304 520__ $$aA novel model is presented to learn bimodally informative structures from audio-visual signals. The signal is represented as a sparse sum of audio- visual kernels. Each kernel is a bimodal function consisting of synchronous snippets of an audio waveform and a spatio-temporal visual basis function. To represent an audio-visual signal, the kernels can be positioned independently and arbitrarily in space and time. The proposed algorithm uses unsupervised learning to form dictionaries of bimodal kernels from audio- visual material. The basis functions that emerge during learning capture salient audio-visual data structures. In addition it is demonstrated that the learned dictionary can be used to locate sources of sound in the movie frame. Specifically, in sequences containing two speakers the algorithm can robustly localize a speaker even in the presence of severe acoustic and visual distracters.
000125304 6531_ $$asparsity
000125304 6531_ $$amultimodal
000125304 6531_ $$alearning
000125304 6531_ $$alts2
000125304 6531_ $$aLTS2
000125304 700__ $$0241005$$g150417$$aMonaci, Gianluca
000125304 700__ $$g120906$$aVandergheynst, Pierre$$0240428
000125304 700__ $$aSommer, Friederich T.
000125304 773__ $$j20$$tIEEE Transactions on Neural Networks$$k12$$q1898-1910
000125304 8564_ $$zURL
000125304 8564_ $$uhttps://infoscience.epfl.ch/record/125304/files/IEEETNN_final.pdf$$zn/a$$s1355663
000125304 909C0 $$xU10380$$0252392$$pLTS2
000125304 909CO $$ooai:infoscience.tind.io:125304$$qGLOBAL_SET$$pSTI$$particle
000125304 937__ $$aEPFL-ARTICLE-125304
000125304 973__ $$rREVIEWED$$sPUBLISHED$$aEPFL
000125304 980__ $$aARTICLE