Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. Attention with Markov: A Curious Case of Single-layer Transformers
 
Loading...
Thumbnail Image
conference paper

Attention with Markov: A Curious Case of Single-layer Transformers

Makkuva, Ashok Vardhan  
•
Bondaschi, Marco  
•
Girish, Adway  
Show more
January 22, 2025
Proceedings of the Thirteenth International Conference on Learning Representations (ICLR) 2025 [Forthcoming publication]
13th International Conference on Learning Representations (ICLR 2025)

Attention-based transformers have achieved tremendous success across a variety of disciplines including natural languages. To deepen our understanding of their sequential modeling capabilities, there is a growing interest in using Markov input processes to study them. A key finding is that when trained on first-order Markov chains, transformers with two or more layers consistently develop an induction head mechanism to estimate the in-context bigram conditional distribution. In contrast, single-layer transformers, unable to form an induction head, directly learn the Markov kernel but often face a surprising challenge: they become trapped in local minima representing the unigram distribution, whereas deeper models reliably converge to the ground-truth bigram. While single-layer transformers can theoretically model first-order Markov chains, their empirical failure to learn this simple kernel in practice remains a curious phenomenon. To explain this contrasting behavior of single-layer models, in this paper we introduce a new framework for a principled analysis of transformers via Markov chains. Leveraging our framework, we theoretically characterize the loss landscape of single-layer transformers and show the existence of global minima (bigram) and bad local minima (unigram) contingent on data properties and model architecture. We precisely delineate the regimes under which these local optima occur. Backed by experiments, we demonstrate that our theoretical findings are in congruence with the empirical results. Finally, we outline several open problems in this arena.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

9590_Attention_with_Markov_A_C.pdf

Type

Main Document

Access type

openaccess

License Condition

CC BY

Size

1.37 MB

Format

Adobe PDF

Checksum (MD5)

855df496e7cb3a6feaf5cab4ec1e246c

Loading...
Thumbnail Image
Name

9590_Attention_with_Markov_A_C_Supplementary Material.zip

Type

Supplementary Material/information

Access type

openaccess

License Condition

CC BY

Size

11.01 MB

Format

ZIP

Checksum (MD5)

43e104de06924f56c8fb587aeefb1b8d

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés