Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. From Pose to Behavior: Advancing Multi-Individual Pose Estimation and Hierarchical Action Segmentation with Transformers
 
doctoral thesis

From Pose to Behavior: Advancing Multi-Individual Pose Estimation and Hierarchical Action Segmentation with Transformers

Stoffl, Lucas  
2025

Accurately measuring behavior is critical to understanding brain function. While traditional behavioral analysis has relied on human observation and manual processing, advances in machine learning and computer vision have begun to automate the field. This thesis addresses the challenge of advancing computer vision techniques to analyze behavior in more complex, multi-individual scenarios, and in the absence of annotated behavioral data. Behavioral analysis from video often begins with pose estimation. First, we present POET (POse Estimation Transformer), a transformer-based model that enables end-to-end training for multi-individual pose estimation through a set-based loss formulation and a transformer encoder-decoder architecture. As the first end-to-end approach of its kind, POET established a new class of single-stage models and inspired subsequent advancements in the field. Pose estimation in crowded scenarios with interacting individuals and occlusions is particularly challenging. To address this, we introduce BUCTD (Bottom-Up Conditioned Top-Down Pose Estimation), a novel two-stage pipeline that combines the strengths of bottom-up and top-down approaches. BUCTD achieves state-of-the-art performance on human and animal benchmarks that focus on crowded scenarios. Finally, we tackle the problem of decomposing behavior from pose trajectories, into its components across multiple spatio-temporal levels. We propose the novel task of hierarchical action segmentation and introduce two new benchmarks: a synthetic 3D basketball dataset with three hierarchical levels, Shot7M2, and an extension of an existing motion capture dataset, hBABEL, focusing on everyday-life behaviors. Additionally, we introduce h/BehaveMAE, a hierarchical masked autoencoder that learns multi-scale latent representations in a self-supervised manner by reconstructing masked input pose trajectories. This method provides a computational framework to parse behavior into interpretable units that span multiple levels of abstraction, capturing both coarse- and fine-grained actions. Overall, this thesis advances methods for multi-individual pose estimation and unsupervised action segmentation and hence the ability to measure and model complex behaviors across diverse contexts.

  • Files
  • Details
  • Metrics
Type
doctoral thesis
DOI
10.5075/epfl-thesis-11049
Author(s)
Stoffl, Lucas  

EPFL

Advisors
Mathis, Alexander  
Jury

Prof. Wulfram Gerstner (président) ; Prof. Alexander Mathis (directeur de thèse) ; Prof. Pascal Fua, Prof. Thomas Brox, Prof. Gül Varol (rapporteurs)

Date Issued

2025

Publisher

EPFL

Publisher place

Lausanne

Public defense year

2025-03-14

Thesis number

11049

Total of pages

170

Subjects

computer vision

•

behavior analysis

•

pose estimation

•

multi-individual tasks

•

self-supervised learning

•

action segmentation

•

hierarchical representations

EPFL units
UPAMATHIS  
Faculty
SV  
School
BMI  
Doctoral School
EDNE  
Available on Infoscience
March 19, 2025
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/248053
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés