From Pose to Behavior: Advancing Multi-Individual Pose Estimation and Hierarchical Action Segmentation with Transformers

Stoffl, Lucas

doi:10.5075/epfl-thesis-11049

doctoral thesis

From Pose to Behavior: Advancing Multi-Individual Pose Estimation and Hierarchical Action Segmentation with Transformers

Stoffl, Lucas

2025

Accurately measuring behavior is critical to understanding brain function. While traditional behavioral analysis has relied on human observation and manual processing, advances in machine learning and computer vision have begun to automate the field. This thesis addresses the challenge of advancing computer vision techniques to analyze behavior in more complex, multi-individual scenarios, and in the absence of annotated behavioral data. Behavioral analysis from video often begins with pose estimation. First, we present POET (POse Estimation Transformer), a transformer-based model that enables end-to-end training for multi-individual pose estimation through a set-based loss formulation and a transformer encoder-decoder architecture. As the first end-to-end approach of its kind, POET established a new class of single-stage models and inspired subsequent advancements in the field. Pose estimation in crowded scenarios with interacting individuals and occlusions is particularly challenging. To address this, we introduce BUCTD (Bottom-Up Conditioned Top-Down Pose Estimation), a novel two-stage pipeline that combines the strengths of bottom-up and top-down approaches. BUCTD achieves state-of-the-art performance on human and animal benchmarks that focus on crowded scenarios. Finally, we tackle the problem of decomposing behavior from pose trajectories, into its components across multiple spatio-temporal levels. We propose the novel task of hierarchical action segmentation and introduce two new benchmarks: a synthetic 3D basketball dataset with three hierarchical levels, Shot7M2, and an extension of an existing motion capture dataset, hBABEL, focusing on everyday-life behaviors. Additionally, we introduce h/BehaveMAE, a hierarchical masked autoencoder that learns multi-scale latent representations in a self-supervised manner by reconstructing masked input pose trajectories. This method provides a computational framework to parse behavior into interpretable units that span multiple levels of abstraction, capturing both coarse- and fine-grained actions. Overall, this thesis advances methods for multi-individual pose estimation and unsupervised action segmentation and hence the ability to measure and model complex behaviors across diverse contexts.