Enhanced Architectures and Optimization Methods for Efficient Language Modeling
This thesis investigates methods to construct more capable models more efficiently, focusing on two aspects: improved architectures and optimization. We examine principled architectural modifications that reduce computational costs or introduce features for more efficient model utilization. Furthermore, we study existing optimization methods to enhance our theoretical understanding of neural network optimization and align it more closely with practical applications allowing more informed decisions to make better optimizers in the future.
In the first part of the thesis, we propose three enhancements to Transformer models to address key challenges in processing long sequences, improving data efficiency, and optimizing inference costs. First, we introduce Landmark Attention to improve the efficiency of processing long sequences, reducing the inference cost by a large constant factor (50x in our experiments). By introducing hierarchy into the attention mechanism, Landmark Attention enables processing of inputs of any length at inference, irrespective of training sequence length. Next, we propose the DenseFormer architecture, which allows future layers to access outputs of previous layers. As a result of the increased information flow, DenseFormer achieves the same perplexity as much deeper Transformer models while outperforming baselines in memory efficiency and inference time. Our experiments reveal unexpected coherent patterns of information flow, showing structured reuse of activations from distant layers. Finally, we address inference efficiency with CoTFormer, inspired by large language models' emerging ability to reason step by step. CoTFormer achieves the accuracy of a deeper model by repeatedly applying a shallower model. This approach introduces additional compute costs but allows for flexible adjustment of inference cost on a per-token basis. Our results demonstrate significant computation cost reductions without accuracy loss when training an adaptive CoTFormer, which automatically allocates compute resources to tokens that need them most.
In the second part of the thesis, motivated by challenges observed in designing optimization methods for adaptive CoTFormer, we focus on improving our understanding of neural network optimization. We develop a theoretical framework to study the effects of parameter perturbation and masking parameter updates on convergence. Additionally, through experiments and theoretical studies, we enhance our understanding of the widely observed phenomenon that a large step size is essential for obtaining superior models. In particular, we present a controlled setting where the difference between small and large step sizes can be provably observed.
EPFL_TH10940.pdf
Main Document
http://purl.org/coar/version/c_be7fb7dd8ff6fe43
openaccess
N/A
3.64 MB
Adobe PDF
688240bfe68327bb4cffcb89aade4ef6