Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Journal articles
  4. Improving speech embedding using crossmodal transfer learning with audio-visual data
 
research article

Improving speech embedding using crossmodal transfer learning with audio-visual data

Le, Nam
•
Odobez, Jean-Marc  
June 1, 2019
Multimedia Tools and Applications

Learning a discriminative voice embedding allows speaker turns to be compared directly and efficiently, which is crucial for tasks such as diarization and verification. This paper investigates several transfer learning approaches to improve a voice embedding using knowledge transferred from a face representation. The main idea of our crossmodal approaches is to constrain the target voice embedding space to share latent attributes with the source face embedding space.The shared latent attributes can be formalized as geometric properties or distribution characterics between these embedding spaces. We propose four transfer learning approaches belonging to two categories: the first category relies on the structure of the source face embedding space to regularize at different granularities the speaker turn embedding space. The second category -a domain adaptation approach- improves the embedding space of speaker turns by applying a maximum mean discrepancy loss to minimize the disparity between the distributions of the embedded features. Experiments are conducted on TV news datasets, REPERE and ETAPE, to demonstrate our methods. Quantitative results in verification and clustering tasks show promising improvement, especially in cases where speaker turns are short or the training data size is limited. The analysis also gives insights the embedding spaces and shows their potential applications.

  • Details
  • Metrics
Type
research article
DOI
10.1007/s11042-018-6992-3
Web of Science ID

WOS:000474249700068

Author(s)
Le, Nam
Odobez, Jean-Marc  
Date Issued

2019-06-01

Published in
Multimedia Tools and Applications
Volume

78

Issue

11

Start page

15681

End page

15704

Subjects

speaker diariazation

•

multimodal identification

•

metric learning

•

transfer learning

•

deep learning

•

speaker

•

face

URL

Related documents

https://publidiap.idiap.ch/index.php/publications/showcite/Le_Idiap-Internal-RR-79-2017
Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
LIDIAP  
Available on Infoscience
January 22, 2019
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/153624
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés