Adapting generic models to specific domains or tasks, a process termed adaptation, has long been of interest in speech and language processing, particularly when target data are insufficient for training bespoke models from scratch. The pre-training fine-tuning paradigm has underpinned the development and application of such generic models, which are initially trained on extensive datasets before subsequent refinement on domain- or task-specific data. While recent large pre-trained models increasingly demonstrate in-context or zero-shot learning capabilities, adaptation remains crucial for significantly enhancing performance when more target data are available. Primarily motivated by the adaptation of text-to-speech synthesis (TTS) models, in this thesis, we investigate a series of adaptation techniques, including both TTS-specific methods and generic fine-tuning approaches, with particular emphasis on data efficiency, parameter efficiency, and generalizability.
The thesis begins by exploring the integration of diffusion models into adaptive TTS systems, motivated by the recent success of deep generative models in synthesizing realistic speech. Building on the Diffusion Transformer architecture, we utilize adaptive layer normalization to condition the diffusion network on text representations, which further enables parameter-efficient adaptation. Compared to convolutional counterparts, the proposed approach offers faster inference for general TTS tasks and outperforms transformer-based adaptive TTS models in terms of naturalness and speaker similarity under few-shot and few-parameter settings.
The second part shifts from ad hoc adaptation to generic parameter-efficient fine-tuning (PEFT) for TTS systems, which increasingly rely on large pre-trained models with strong zero-shot capabilities. Despite PEFT enabling efficient adaptation, catastrophic forgetting remains an issue, damaging the base model's generalizability. To mitigate this, we apply Bayesian transfer learning techniques to regularize PEFT with low-rank adaptation (LoRA) and preserve pre-training knowledge, utilizing diagonal and Kronecker-factored Laplace approximations. Experiments on language modeling and TTS demonstrate that catastrophic forgetting can be overcome by our methods without degrading fine-tuning performance, with Kronecker-factored approximation yielding superior pre-training knowledge preservation.
Continuing the exploration of Bayesian learning theory from the previous part, the final part of this thesis investigates the applications of variational inference to PEFT. Unlike Laplace approximation, variational inference frames posterior estimation as an online optimization problem, allowing for more flexible and expressive distributions. We first assess its effectiveness in improving predictive accuracy and calibration relative to Laplace-based methods. We then leverage its online posterior estimates to identify and prune redundant LoRA components, enabling automatic, layer-wise allocation of the parameter budget.
In summary, the thesis contributes to the advancement of adaptive TTS systems and offers Bayesian perspectives on enhancing generic adaptation techniques with respect to generalizability and efficiency. In particular, it provides a principled investigation of posterior estimation for adapted parameters using both Laplace approximation and variational inference, highlighting the advantages of Bayesian learning in fine-tuning.
EPFL_TH11400.pdf
Main Document
Not Applicable (or Unknown)
restricted
N/A
1.17 MB
Adobe PDF
30a230bb6063581c9cd64097e9ccb8a8