Unified and Multimodal Learning for Gaze Prediction in Naturalistic Settings
Gaze is a powerful cue for understanding attention, intention, and social interaction. This thesis presents a comprehensive study of gaze prediction in naturalistic settings, with a focus on developing models, datasets and evaluation protocols that go beyond spatial localization to capture the semantic and social dimensions of gaze behavior. We address key limitations in prior work and advance gaze prediction along several axes.
First, we introduce new datasets and annotations to support multimodal and multi-task learning. These include ChildPlay-audio, which augments child-adult interactions with speaking status; VSGaze, a unified benchmark with annotations for gaze following and social gaze tasks; and new semantic gaze annotations for the RLR-CHAT corpus to enable ego-exo gaze modeling. We also propose new evaluation protocols that extend beyond location-based metrics to assess semantic and socially grounded performance.
Second, we develop new architectures for gaze prediction. These include multimodal gaze following models that incorporate depth and pose; unified frameworks that jointly model gaze following and social gaze behaviors; and approaches to egocentric gaze estimation that leverage exocentric context. We further explore the use of foundation models and vision-language models to extract robust features for these tasks.
Finally, we demonstrate the feasibility of applying these models to child-adult interaction videos in the context of early language learning, where gaze plays a crucial role. Taken together, these contributions lay the groundwork for gaze models that are not only accurate but also semantically meaningful, capable of leveraging complementary contextual and task information, and applicable to real-world settings.
EPFL_TH11181.pdf
Main Document
Published version
openaccess
N/A
83.76 MB
Adobe PDF
7fb1588b4c1c3e99b1cde5d3bc5f9ee7