Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. On Modeling the Synergy Between Acoustic and Lexical Information for Pronunciation Lexicon Development
 
Loading...
Thumbnail Image
doctoral thesis

On Modeling the Synergy Between Acoustic and Lexical Information for Pronunciation Lexicon Development

Razavi, Marzieh  
2017

State-of-the-art automatic speech recognition (ASR) and text-to-speech systems require a pronunciation lexicon that maps each word to a sequence of phones. Manual development of lexicons is costly as it needs linguistic knowledge and human expertise. To facilitate this process, grapheme-to-phone (G2P) conversion approaches are used, in which given a seed lexicon provided by linguistic experts, the G2P relationship is learned by applying statistical techniques. Despite advances in these approaches, there are two challenges remaining: (1) the seed lexicon development through linguistic expertise incorporates limited acoustic information, which may not necessarily cover all natural phonological variations, and (2) the linguistic expertise required for the development of the seed lexicon may not be available for all languages, particularly under-resourced languages. The goal of this thesis is to address these challenges by developing a framework that effectively integrates linguistic information and acoustic data for pronunciation lexicon development. To achieve that goal, we first study the problem of matching a word hypothesis to the acoustic signal, and show that the hidden Markov model-based ASR approach achieves that match via a latent symbol set. Building on that understanding, we develop a data-driven G2P conversion approach in which a probabilistic G2P relationship is learned by matching the acoustic signal with the word hypothesis represented by graphemes, using phones as the latent symbols. Through a theoretical development, we show that this acoustic G2P conversion approach is a particular case of an abstract posterior-based G2P conversion formalism, which requires estimation of phone class conditional probabilities. Through studies on two languages, we show that the acoustic G2P conversion approach yields lexicons that can perform comparable to state-of-the-art G2P conversion methods at the ASR level, despite performing relatively poorly at pronunciation level. We build on the posterior-based formalism to show that different G2P conversion approaches in the literature can be regarded as different estimators of phone class conditional probabilities, and can be combined in a multi-stream fashion to yield better lexicons. We also demonstrate that the multi-stream formulation can be further extended to unify acoustic-to-phone conversion and G2P conversion. We validate the proposed multi-stream formulation on two challenging tasks on English. Finally, we address the issue of developing lexical resources for under-resourced languages by proposing an acoustic subword unit (ASWU)-based lexicon development approach. In this approach, ASWU derivation is cast as the problem of determining a latent symbol space given the word hypothesis and acoustics, and the pronunciations are generated using the proposed acoustic G2P conversion approach. Through experimental studies and analysis on well-resourced and under-resourced languages, we show that the derived ASWUs are "phone-like" , and the ASWU-based lexicons yield better ASR systems compared to the alternative grapheme-based lexicons.

  • Files
  • Details
  • Metrics
Type
doctoral thesis
DOI
10.5075/epfl-thesis-7851
Author(s)
Razavi, Marzieh  
Advisors
Bourlard, Hervé  
•
Magimai Doss, Mathew  
Jury

professeure Sabine Süsstrunk (présidente) ; Prof. Hervé Bourlard, Dr Mathew Magimai Doss (directeurs) ; Prof. Jean-Philippe Thiran, Dr Kate Knill, Prof. Marelie Davel (rapporteurs)

Date Issued

2017

Publisher

EPFL

Publisher place

Lausanne

Public defense year

2017-08-10

Thesis number

7851

Total of pages

164

Subjects

Phonetic lexicon development

•

grapheme-to-phone conversion

•

acoustic subwordunit discovery

•

hidden Markov model

•

automatic speech recognition

•

under-resourcedlanguages.

EPFL units
LIDIAP  
Faculty
STI  
School
IEL  
Doctoral School
EDEE  
Available on Infoscience
August 9, 2017
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/139584
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés