Training deep neural network based Automatic Speech Recognition (ASR) models often requires thousands of hours of transcribed data, limiting their use to only a few languages. Moreover, current state-of-the-art acoustic models are based on the Transformer architecture that scales quadratically with sequence lengths, hindering its use for long sequences. This thesis aims to reduce (a) the data and (b) the compute requirements for developing state-of-the-art ASR systems with only a few hundred hours of transcribed data or less.
The first part of this thesis focuses on reducing the amount of transcribed data required to train these models. We propose an approach that uses dropout for uncertainty-aware semi-supervised learning. We show that our approach generates better hypotheses for training with unlabelled data. We then investigate the out-of-domain and cross-lingual generalization for two popular self-supervised pre-training approaches: Masked Acoustic Modeling and wav2vec 2.0. We conclude that both pre-training approaches generalize to unseen domains and significantly outperform the models trained only with supervised data.
In the second part, we focus on reducing the computational requirements for the Transformer model, (a) by devising efficient forms of attention computation and (b) by reducing the input context length for attention computation. We first present 'linear' attention that uses a kernelized formulation for attention to express an autoregressive transformer as a recurrent neural network and reduce the computational complexity from quadratic to linear in sequence length. We then present 'clustered' attention which approximates self-attention by clustering input sequence and using centroids for computation. We show that the clustered attention outperforms the vanilla attention for a given computational budget.
For ASR, we find that linear attention results in word error rate degradation, and clustering introduces overheads when working with shorter sequences. To address this limitation, we develop a method that stochastically downsamples input using mean-pooling for efficient wav2vec 2.0 training. This enables using the same model at different compression factors during inference. We conclude that stochastic compression for wav2vec 2.0 pre-training enables building compute-efficient ASR models for languages with limited transcribed data.
EPFL_TH9766.pdf
n/a
openaccess
copyright
4.55 MB
Adobe PDF
075e17aba84064378de302cb1fe9d218