Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. Nonparametric Variational Information Bottleneck: Attention-based Architectures as Latent Variable Models
 
doctoral thesis

Nonparametric Variational Information Bottleneck: Attention-based Architectures as Latent Variable Models

Fehr, Fabio James  
2025

Transformers have achieved remarkable success across modalities including text, graphs, speech, and vision, enabled by the attention mechanism. Yet the inductive biases that shape how attention encodes information and supports generalisation are still not well understood. Latent variable models offer a principled framework for explaining the encoded information, improving generalisation through regularisation, and enabling generative modelling. However, applying latent variable models to attention-based architectures is challenging, as attention functions over sets that are both variable in size and permutation-invariant. This thesis introduces the Nonparametric Variational Information Bottleneck (NVIB), a deep latent variable framework that models attention as posterior inference over a Dirichlet process mixture, aligning naturally with these set-based properties. We show that NVIB enables training a novel Transformer-based variational autoencoder from scratch, sparsifying the number of embeddings while regularising their content. As a generative model, it supports smooth interpolation and sampling within variable-sized latent spaces. When applied across stacked self-attention layers, NVIB induces hierarchical abstraction, improving interpretability, robustness, and linguistic alignment. This framework allows for pretrained Transformers to be reinterpreted as nonparametric variational models. NVIB reveals how they encode and separate reliable from unreliable information, enabling a novel and controllable post-training regularisation that improves out-of-distribution generalisation. Finally, NVIB boosts out-of-distribution performance during fine-tuning on speech, text, graph, and vision benchmarks, confirming its effectiveness in inducing generalisable representations across diverse models and tasks. Overall, the thesis offers a variational Bayesian perspective on attention, unifying regularisation, explanation, and generation, and opening new paths for advancing representation learning.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

EPFL_TH11133.pdf

Type

Main Document

Version

Not Applicable (or Unknown)

Access type

openaccess

License Condition

N/A

Size

10.11 MB

Format

Adobe PDF

Checksum (MD5)

a8aae961bf06194b743b61eb31643479

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés