Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control
 
conference paper not in proceedings

GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

Hassan, Mariam  
•
Stapf, Sebastian
•
Rahimi, Ahmad  
Show more
December 15, 2024
The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025

We present GEM, a Generalizable Ego-vision Multimodal world model that predicts future frames using a reference frame, sparse features, human poses, and ego-trajectories. Hence, our model has precise control over object dynamics, ego-agent motion and human poses. GEM generates paired RGB and depth outputs for richer spatial understanding. We introduce autoregressive noise schedules to enable stable long-horizon generations. Our dataset is comprised of 4000+ hours of multimodal data across domains like autonomous driving, egocentric human activities, and drone flights. Pseudo-labels are used to get depth maps, ego-trajectories, and human poses. We use a comprehensive evaluation framework, including a new Control of Object Manipulation (COM) metric, to assess controllability. Experiments show GEM excels at generating diverse, controllable scenarios and temporal consistency over long generations. Code, models, and datasets are fully open-sourced.

  • Files
  • Details
  • Metrics
Type
conference paper not in proceedings
ArXiv ID

https://arxiv.org/abs/2412.11198v1

Author(s)
Hassan, Mariam  

EPFL

Stapf, Sebastian
Rahimi, Ahmad  

EPFL

Rezende, Pedro M. B.
Haghighi, Yasaman
Brüggemann, David

Swiss Data Science Center

Katircioglu, Isinsu  

EPFL

Zhang, Lin
Chen, Xiaoran
Saha, Suman
Show more
Date Issued

2024-12-15

Subjects

Computer Science - Computer Vision and Pattern Recognition

Written at

EPFL

EPFL units
VITA  
Event nameEvent acronymEvent placeEvent date
The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025

CVPR

Nashville, Tennessee, US

2025-06-11 - 2025-06-15

Available on Infoscience
May 5, 2025
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/249741
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés