Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Journal articles
  4. Phonetic Subspace Features for Improved Query by Example Spoken Term Detection
 
research article

Phonetic Subspace Features for Improved Query by Example Spoken Term Detection

Ram, Dhananjay
•
Asaei, Afsaneh
•
Bourlard, Hervé
2018
Speech Communication

This paper addresses the problem of detecting speech utterances from a large audio archive using a simple spoken query, hence referring to this problem as "Query by Example Spoken Term Detection" (QbE-STD). This still open pattern matching problem has been addressed in different contexts, often based on variants of the Dynamic Time Warping (DTW) algorithm. In the work reported here, we exploit Deep Neural Networks (DNN) and the so inferred phone posteriors to better model the phonetic subspaces and, consequently, improve the QbE-STD performance. Those phone posteriors have indeed been shown to properly model the union of the underlying low-dimensional phonetic subspaces. Exploiting this property. we investigate here two methods relying on sparse modeling and linguistic knowledge of sub-phonetic components. Sparse modeling characterizes the phonetic subspaces through a dictionary for sparse coding. Projection of the phone posteriors through reconstruction on the corresponding subspaces using their sparse representation enhance those phone posteriors. On the other hand, linguistic knowledge driven sub-phonetic structures are identified using phonological posteriors which consists of the probabilities of phone attributes estimated by DNNs, resulting in a new set of feature vectors. These phonological posteriors provide complementary information and a distance fusion method is proposed to integrate information from phone and phonological posterior features. Both posterior features are used for query detection using DTW and evaluated on AMI database. We demonstrate that the subspace enhanced phone posteriors obtained using sparse reconstruction outperforms the conventional DNN posteriors. The distance fusion technique gives further improvement in QbE-STD performance.

  • Details
  • Metrics
Type
research article
DOI
10.1016/j.specom.2018.07.001
Author(s)
Ram, Dhananjay
Asaei, Afsaneh
Bourlard, Hervé
Date Issued

2018

Published in
Speech Communication
Volume

103

Start page

27

End page

36

Subjects

Deep neural network

•

Dictionary learning

•

Phone posterior

•

Phonological posterior

•

query by example

•

sparse representation

•

spoken term detection

URL

Related documents

https://publidiap.idiap.ch/downloads//papers/2018/Ram_SPEECHCOMMUNICATION_2018.pdf

Related documents

https://publidiap.idiap.ch/index.php/publications/showcite/Ram_Idiap-Internal-RR-110-2017
Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
LIONS  
LIDIAP  
Available on Infoscience
January 22, 2019
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/153634
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés