Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. Unified and Multimodal Learning for Gaze Prediction in Naturalistic Settings
 
doctoral thesis

Unified and Multimodal Learning for Gaze Prediction in Naturalistic Settings

Gupta, Anshul  
2025

Gaze is a powerful cue for understanding attention, intention, and social interaction. This thesis presents a comprehensive study of gaze prediction in naturalistic settings, with a focus on developing models, datasets and evaluation protocols that go beyond spatial localization to capture the semantic and social dimensions of gaze behavior. We address key limitations in prior work and advance gaze prediction along several axes.

First, we introduce new datasets and annotations to support multimodal and multi-task learning. These include ChildPlay-audio, which augments child-adult interactions with speaking status; VSGaze, a unified benchmark with annotations for gaze following and social gaze tasks; and new semantic gaze annotations for the RLR-CHAT corpus to enable ego-exo gaze modeling. We also propose new evaluation protocols that extend beyond location-based metrics to assess semantic and socially grounded performance.

Second, we develop new architectures for gaze prediction. These include multimodal gaze following models that incorporate depth and pose; unified frameworks that jointly model gaze following and social gaze behaviors; and approaches to egocentric gaze estimation that leverage exocentric context. We further explore the use of foundation models and vision-language models to extract robust features for these tasks.

Finally, we demonstrate the feasibility of applying these models to child-adult interaction videos in the context of early language learning, where gaze plays a crucial role. Taken together, these contributions lay the groundwork for gaze models that are not only accurate but also semantically meaningful, capable of leveraging complementary contextual and task information, and applicable to real-world settings.

  • Files
  • Details
  • Metrics
Type
doctoral thesis
DOI
10.5075/epfl-thesis-11181
Author(s)
Gupta, Anshul  

EPFL

Advisors
Odobez, Jean-Marc  
Jury

Prof. Pascal Frossard (président) ; Dr Jean-Marc Odobez (directeur de thèse) ; Dr Xi Wang, Dr Sean Andrist, Prof. Andreas Bulling (rapporteurs)

Date Issued

2025

Publisher

EPFL

Publisher place

Lausanne

Public defense year

2025-08-20

Thesis number

11181

Total of pages

176

Subjects

gaze following

•

social gaze prediction

•

egocentric gaze prediction

•

multimodal behaviour understanding

EPFL units
LIDIAP  
Faculty
STI  
School
IEL  
Doctoral School
EDEE  
Available on Infoscience
August 18, 2025
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/252915
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés