Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. From Pose to Behavior: Advancing Multi-Individual Pose Estimation and Hierarchical Action Segmentation with Transformers
 
doctoral thesis

From Pose to Behavior: Advancing Multi-Individual Pose Estimation and Hierarchical Action Segmentation with Transformers

Stoffl, Lucas  
2025

Accurately measuring behavior is critical to understanding brain function. While traditional behavioral analysis has relied on human observation and manual processing, advances in machine learning and computer vision have begun to automate the field. This thesis addresses the challenge of advancing computer vision techniques to analyze behavior in more complex, multi-individual scenarios, and in the absence of annotated behavioral data. Behavioral analysis from video often begins with pose estimation. First, we present POET (POse Estimation Transformer), a transformer-based model that enables end-to-end training for multi-individual pose estimation through a set-based loss formulation and a transformer encoder-decoder architecture. As the first end-to-end approach of its kind, POET established a new class of single-stage models and inspired subsequent advancements in the field. Pose estimation in crowded scenarios with interacting individuals and occlusions is particularly challenging. To address this, we introduce BUCTD (Bottom-Up Conditioned Top-Down Pose Estimation), a novel two-stage pipeline that combines the strengths of bottom-up and top-down approaches. BUCTD achieves state-of-the-art performance on human and animal benchmarks that focus on crowded scenarios. Finally, we tackle the problem of decomposing behavior from pose trajectories, into its components across multiple spatio-temporal levels. We propose the novel task of hierarchical action segmentation and introduce two new benchmarks: a synthetic 3D basketball dataset with three hierarchical levels, Shot7M2, and an extension of an existing motion capture dataset, hBABEL, focusing on everyday-life behaviors. Additionally, we introduce h/BehaveMAE, a hierarchical masked autoencoder that learns multi-scale latent representations in a self-supervised manner by reconstructing masked input pose trajectories. This method provides a computational framework to parse behavior into interpretable units that span multiple levels of abstraction, capturing both coarse- and fine-grained actions. Overall, this thesis advances methods for multi-individual pose estimation and unsupervised action segmentation and hence the ability to measure and model complex behaviors across diverse contexts.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

EPFL_TH11049.pdf

Type

Main Document

Version

Not Applicable (or Unknown)

Access type

openaccess

License Condition

N/A

Size

75.94 MB

Format

Adobe PDF

Checksum (MD5)

5749516a8d1902ee05e6e160d39aefb8

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés