Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. Enhanced Architectures and Optimization Methods for Efficient Language Modeling
 
doctoral thesis

Enhanced Architectures and Optimization Methods for Efficient Language Modeling

Mohtashami, Amirkeivan  
2025

This thesis investigates methods to construct more capable models more efficiently, focusing on two aspects: improved architectures and optimization. We examine principled architectural modifications that reduce computational costs or introduce features for more efficient model utilization. Furthermore, we study existing optimization methods to enhance our theoretical understanding of neural network optimization and align it more closely with practical applications allowing more informed decisions to make better optimizers in the future.

In the first part of the thesis, we propose three enhancements to Transformer models to address key challenges in processing long sequences, improving data efficiency, and optimizing inference costs. First, we introduce Landmark Attention to improve the efficiency of processing long sequences, reducing the inference cost by a large constant factor (50x in our experiments). By introducing hierarchy into the attention mechanism, Landmark Attention enables processing of inputs of any length at inference, irrespective of training sequence length. Next, we propose the DenseFormer architecture, which allows future layers to access outputs of previous layers. As a result of the increased information flow, DenseFormer achieves the same perplexity as much deeper Transformer models while outperforming baselines in memory efficiency and inference time. Our experiments reveal unexpected coherent patterns of information flow, showing structured reuse of activations from distant layers. Finally, we address inference efficiency with CoTFormer, inspired by large language models' emerging ability to reason step by step. CoTFormer achieves the accuracy of a deeper model by repeatedly applying a shallower model. This approach introduces additional compute costs but allows for flexible adjustment of inference cost on a per-token basis. Our results demonstrate significant computation cost reductions without accuracy loss when training an adaptive CoTFormer, which automatically allocates compute resources to tokens that need them most.

In the second part of the thesis, motivated by challenges observed in designing optimization methods for adaptive CoTFormer, we focus on improving our understanding of neural network optimization. We develop a theoretical framework to study the effects of parameter perturbation and masking parameter updates on convergence. Additionally, through experiments and theoretical studies, we enhance our understanding of the widely observed phenomenon that a large step size is essential for obtaining superior models. In particular, we present a controlled setting where the difference between small and large step sizes can be provably observed.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

EPFL_TH10940.pdf

Type

Main Document

Version

http://purl.org/coar/version/c_be7fb7dd8ff6fe43

Access type

openaccess

License Condition

N/A

Size

3.64 MB

Format

Adobe PDF

Checksum (MD5)

688240bfe68327bb4cffcb89aade4ef6

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés