Combining Vocal Tract Length Normalization with Hierarchical Linear Transformations

Saheer, Lakshmi; Yamagishi, Junichi; Garner, Philip N.; Dines, John

doi:10.1109/Jstsp.2013.2295554

Saheer, Lakshmi; Yamagishi, Junichi; Garner, Philip N.; Dines, John

2014

Formats

Format
BibTeX
MARC
MARCXML
DublinCore
EndNote
NLM
RefWorks
RIS

Abstract

Recent research has demonstrated the effectiveness of vocal tract length normalization (VTLN) as a rapid adaptation technique for statistical parametric speech synthesis. VTLN produces speech with naturalness preferable to that of MLLR-based adaptation techniques, being much closer in quality to that generated by the original average voice model. However, with only a single parameter, VTLN captures very few speaker specific characteristics when compared to linear transform based adaptation techniques. This paper shows that the merits of VTLN can be combined with those of linear transform based adaptation in a hierarchical Bayesian framework, where VTLN is used as the prior information. A novel technique for propagating the gender and age information captured by the VTLN transform into constrained structural maximum a posteriori linear regression (CSMAPLR) adaptation is presented. This paper also compares this proposed technique to other combination techniques. Experiments are performed on both matched and mismatched training and test conditions, including gender, age, and recording environments. Text-to-speech (TTS) synthesis experiments show that the resulting transformation produces improved speech quality with better naturalness and intelligibility (similar to VTLN transformation) when compared to the CSMAPLR transformation, especially when the quantity of adaptation data is very limited. With more parameters to capture speaker characteristics, the proposed method performs better in speaker similarity compared to VTLN in mis-matched conditions. Hence, the proposed combination combines the quality and intelligibility of VTLN with the speaker similarity of CSMAPLR especially in the mismatched train and test conditions. Experiments are also performed using the automatic speech recognition (ASR) system in a unified framework as that of synthesis. This is to prove that the techniques developed for TTS can be plugged into ASR in order to improve the performance.

Details

Title Combining Vocal Tract Length Normalization with Hierarchical Linear Transformations

Author(s) Saheer, Lakshmi ; Yamagishi, Junichi ; Garner, Philip N. ; Dines, John

Published in IEEE Journal of Selected Topics in Signal Processing

Volume 8

Issue 2

Pages 262-272

Date 2014

ISSN 1932-4553

Keywords

Constrained structural maximum a posteriori linear regression; hidden Markov models; speaker adaptation; statistical parametric speech synthesis; vocal tract length normalization

Note Special Issue on Statistical Parametric Speech Synthesis

DOI https://doi.org/10.1109/Jstsp.2013.2295554

Other identifier(s) View record in Web of Science

Laboratories LIDIAP

Record Appears in Scientific production and competences > STI - School of Engineering > IEM - Institut d'Electricité et de Microtechnique > LIDIAP - L'IDIAP Laboratory
Scientific production and competences > Euler Center for Signal Processing
Work produced at EPFL
Journal Articles
Published

Record creation date 2013-12-19

Abstract

Details

Actions