Files

Abstract

In this dissertation, we propose multiple methods to improve transfer learning for pretrained language models (PLMs). Broadly, transfer learning is a powerful technique in natural language processing, where a language model is first pre-trained on a data-rich task before being fine-tuned on a downstream task. Our first contribution is to propose two learning strategies to train neural models, which are more robust to dataset biases and transfer better to out-of-domain datasets. We specify the biases in terms of bias-only models, which learn to leverage the dataset biases. The bias-only models' predictions are then used to adjust the loss of the base model to reduce its reliance on biases by down-weighting the biased examples and focusing training on the hard examples. Our second contribution is to propose an effective regularization method to reduce overfitting when fine-tuning PLMs on low-resource tasks. We leverage Variational Information Bottleneck to suppress irrelevant features, and show that our method effectively reduces overfitting, finds sentence representations that are more robust to biases, and substantially improves generalization to out-of-domain datasets. Our third contribution is to develop an effective and parameter-efficient way to fine-tune PLMs in a multi-task learning setup while allowing generalization to new domains. It allows sharing information across tasks to enable positive transfer to low-resource and related tasks, while avoiding negative task interference. It employs a compact hypernetwork shared across tasks and layers which learns to generate task and layer-specific adapter parameters. This allows sharing knowledge across tasks while task-specific adapters enable the model to adapt to each individual task. Our fourth contribution is to propose Compacter, a method for fine-tuning large-scale language models with a better trade-off between task performance and the number of trainable parameters than prior work. Compacter accomplishes this by building on top of ideas from adapters, low-rank optimization, and parameterized hypercomplex multiplication layers. Specifically, Compacter inserts task-specific weight matrices into a PLM's weights, which are computed efficiently as a sum of Kronecker products between shared slow weights and fast rank-one layer-specific matrices. By only training 0.047% of a PLM's parameters, it performs on par with standard fine-tuning and outperforms it on low-resource settings. Our final contribution is to propose Perfect, a simple and efficient method for few-shot fine-tuning of PLMs without relying on any handcrafting, which is highly effective given as few as 32 data points. This is in contrast to prior methods that require carefully engineered prompts and verbalizers to convert examples into a cloze-format that the PLM can score. Perfect makes two key design choices: First, we show that manually engineered task prompts can be replaced with task-specific adapters that enable sample-efficient fine-tuning and reduce memory and storage costs by roughly factors of 5 and 100, respectively. Second, instead of using handcrafted verbalizers, we learn a new multi-token label embedding during fine-tuning which is not tied to the model vocabulary and allows avoiding complex auto-regressive decoding. Perfect enables nearly 100x faster training and inference and outperforms existing state-of-the-art few-shot learning methods.

Details

PDF