000230230 001__ 230230
000230230 005__ 20190125163149.0
000230230 037__ $$aCONF
000230230 245__ $$aImproving speaker turn embedding by crossmodal transfer learning from face embedding
000230230 269__ $$a2017
000230230 260__ $$c2017
000230230 336__ $$aConference Papers
000230230 520__ $$aLearning speaker turn embeddings has shown considerable improvement in situations where conventional speaker modeling approaches fail. However, this improvement is relatively limited when compared to the gain observed in face embedding learning, which has proven very successful for face verification and clustering tasks. Assuming that face and voices from the same identities share some latent properties (like age, gender, ethnicity), we propose two transfer learning approaches to leverage the knowledge from the face domain learned from thousands of identities for tasks in the speaker domain. These approaches, namely target embedding transfer and clustering structure transfer, utilize the structure of the source face embedding space at different granularities to regularize the target speaker turn embedding space as optimizing terms. Our methods are evaluated on two public broadcast corpora and yield promising advances over competitive baselines in verification and audio clustering tasks, especially when dealing with short speaker utterances. The analysis gives insight into characteristics of the embedding spaces and shows their potential applications.
000230230 700__ $$aLe, Nam
000230230 700__ $$g161663$$aOdobez, Jean-Marc$$0243995
000230230 7112_ $$aICCV Workshop on Computer Vision for Audio-Visual Media
000230230 909C0 $$xU10381$$0252189$$pLIDIAP
000230230 909CO $$pconf$$pSTI$$ooai:infoscience.tind.io:230230
000230230 937__ $$aEPFL-CONF-230230
000230230 970__ $$aLe_ICCV-CVAVM_2017/LIDIAP
000230230 973__ $$aEPFL
000230230 980__ $$aCONF