000264196 001__ 264196
000264196 005__ 20190509131808.0
000264196 0247_ $$a10.5075/epfl-thesis-9035$$2doi
000264196 037__ $$aTHESIS
000264196 041__ $$aeng
000264196 088__ $$a9035
000264196 245__ $$aSparse and Low-rank Modeling for Automatic Speech Recognition
000264196 260__ $$aLausanne$$bEPFL$$c2019
000264196 269__ $$a2019
000264196 300__ $$a158
000264196 336__ $$aTheses
000264196 502__ $$aProf. Jean-Philippe Thiran (président) ; Prof. Hervé Bourlard (directeur de thèse) ; Prof. Jean-Marc Vesin, Prof. Florian Metze, Dr Ralf Schlüter (rapporteurs)
000264196 520__ $$aThis thesis deals with exploiting the low-dimensional multi-subspace structure of speech towards the goal of improving acoustic modeling for automatic speech recognition (ASR). Leveraging the parsimonious hierarchical nature of speech, we hypothesize that whenever a speech signal is measured in a high-dimensional feature space, the true class information is embedded in low-dimensional subspaces whereas noise is scattered as random high-dimensional erroneous estimations in the features. In this context, the contribution of this thesis is twofold: (i) identify sparse and low-rank modeling approaches as excellent tools for extracting the class-specific low-dimensional subspaces in speech features, and (ii) employ these tools under novel ASR frameworks to enrich the acoustic information present in the speech features towards the goal of improving ASR. Techniques developed in this thesis focus on deep neural network (DNN) based posterior features which, under the sparse and low-rank modeling approaches, unveil the underlying class-specific low-dimensional subspaces very elegantly.

In this thesis, we tackle ASR tasks of varying difficulty, ranging from isolated word recognition (IWR) and connected digit recognition (CDR) to large-vocabulary continuous speech recognition (LVCSR). For IWR and CDR, we propose a novel \textit{Compressive Sensing} (CS) perspective towards ASR. Here exemplar-based speech recognition is posed as a problem of recovering sparse high-dimensional word representations from compressed low-dimensional phonetic representations. In the context of LVCSR, this thesis argues that albeit their power in representation learning, DNN based acoustic models still have room for improvement in exploiting the \textit{union of low-dimensional subspaces} structure of speech data. Therefore, this thesis proposes to enhance DNN posteriors by projecting them onto the manifolds of the underlying classes using principal component analysis (PCA) or compressive sensing based dictionaries. Projected posteriors are shown to be more accurate training targets for learning better acoustic models, resulting in improved ASR performance. The proposed approach is evaluated on both close-talk and far-field conditions, confirming the importance of sparse and low-rank modeling of speech in building a robust ASR framework. Finally, the conclusions of this thesis are further consolidated by an information theoretic analysis approach which explicitly quantifies the contribution of proposed techniques in improving ASR.
000264196 592__ $$b2019
000264196 6531_ $$aautomatic speech recognition
000264196 6531_ $$adeep neural network
000264196 6531_ $$asparsity
000264196 6531_ $$adictionary learning
000264196 6531_ $$alow-rank
000264196 6531_ $$aprincipal component analysis
000264196 6531_ $$afar-field speech
000264196 6531_ $$ainformation theory
000264196 700__ $$aDighe, Pranay$$g248345
000264196 720_2 $$aBourlard, Hervé$$edir.$$g117014
000264196 8564_ $$uhttps://infoscience.epfl.ch/record/264196/files/EPFL_TH9035.pdf$$s10620952
000264196 909C0 $$pLIDIAP
000264196 909CO $$pthesis$$pSTI$$pthesis-public$$pDOI$$ooai:infoscience.epfl.ch:264196$$qGLOBAL_SET
000264196 918__ $$aSTI$$cIEL$$dEDEE
000264196 919__ $$aLIDIAP
000264196 920__ $$a2019-03-08$$b2019
000264196 970__ $$a9035/THESES
000264196 973__ $$sPUBLISHED$$aEPFL
000264196 980__ $$aTHESIS