Sparse and Low-rank Modeling for Automatic Speech Recognition

Dighe, Pranay

doi:10.5075/epfl-thesis-9035

Dighe, Pranay

2019

Download

Formats

Format
BibTeX
MARC
MARCXML
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

This thesis deals with exploiting the low-dimensional multi-subspace structure of speech towards the goal of improving acoustic modeling for automatic speech recognition (ASR). Leveraging the parsimonious hierarchical nature of speech, we hypothesize that whenever a speech signal is measured in a high-dimensional feature space, the true class information is embedded in low-dimensional subspaces whereas noise is scattered as random high-dimensional erroneous estimations in the features. In this context, the contribution of this thesis is twofold: (i) identify sparse and low-rank modeling approaches as excellent tools for extracting the class-specific low-dimensional subspaces in speech features, and (ii) employ these tools under novel ASR frameworks to enrich the acoustic information present in the speech features towards the goal of improving ASR. Techniques developed in this thesis focus on deep neural network (DNN) based posterior features which, under the sparse and low-rank modeling approaches, unveil the underlying class-specific low-dimensional subspaces very elegantly. In this thesis, we tackle ASR tasks of varying difficulty, ranging from isolated word recognition (IWR) and connected digit recognition (CDR) to large-vocabulary continuous speech recognition (LVCSR). For IWR and CDR, we propose a novel \textit{Compressive Sensing} (CS) perspective towards ASR. Here exemplar-based speech recognition is posed as a problem of recovering sparse high-dimensional word representations from compressed low-dimensional phonetic representations. In the context of LVCSR, this thesis argues that albeit their power in representation learning, DNN based acoustic models still have room for improvement in exploiting the \textit{union of low-dimensional subspaces} structure of speech data. Therefore, this thesis proposes to enhance DNN posteriors by projecting them onto the manifolds of the underlying classes using principal component analysis (PCA) or compressive sensing based dictionaries. Projected posteriors are shown to be more accurate training targets for learning better acoustic models, resulting in improved ASR performance. The proposed approach is evaluated on both close-talk and far-field conditions, confirming the importance of sparse and low-rank modeling of speech in building a robust ASR framework. Finally, the conclusions of this thesis are further consolidated by an information theoretic analysis approach which explicitly quantifies the contribution of proposed techniques in improving ASR.

Details

Title Sparse and Low-rank Modeling for Automatic Speech Recognition

Author(s) Dighe, Pranay

Advisor(s)

Bourlard, Hervé

Pagination 158

Date 2019

Publisher Lausanne, EPFL

Keywords

automatic speech recognition; deep neural network; sparsity; dictionary learning; low-rank; principal component analysis; far-field speech; information theory

Language English

DOI https://doi.org/10.5075/epfl-thesis-9035

Laboratories LIDIAP

Record Appears in Scientific production and competences > STI - School of Engineering > IEM - Institut d'Electricité et de Microtechnique > LIDIAP - L'IDIAP Laboratory
Scientific production and competences > Euler Center for Signal Processing
Scientific production and competences > EPFL Theses
Work produced at EPFL
Published
Theses

Record creation date 2019-02-28

Actions

Preview

Select file: