GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained
  Ego-Motion, Object Dynamics, and Scene Composition Control

Hassan, Mariam; Stapf, Sebastian; Rahimi, Ahmad; Rezende, Pedro M. B.; Haghighi, Yasaman; Brüggemann, David; Katircioglu, Isinsu; Zhang, Lin; Chen, Xiaoran; Saha, Suman; Cannici, Marco; Aljalbout, Elie; Ye, Botao; Wang, Xi; Davtyan, Aram; Salzmann, Mathieu; Scaramuzza, Davide; Pollefeys, Marc; Favaro, Paolo; Alahi, Alexandre

conference paper not in proceedings

GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

Hassan, Mariam

•

Stapf, Sebastian

•

Rahimi, Ahmad

December 15, 2024

The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025

We present GEM, a Generalizable Ego-vision Multimodal world model that predicts future frames using a reference frame, sparse features, human poses, and ego-trajectories. Hence, our model has precise control over object dynamics, ego-agent motion and human poses. GEM generates paired RGB and depth outputs for richer spatial understanding. We introduce autoregressive noise schedules to enable stable long-horizon generations. Our dataset is comprised of 4000+ hours of multimodal data across domains like autonomous driving, egocentric human activities, and drone flights. Pseudo-labels are used to get depth maps, ego-trajectories, and human poses. We use a comprehensive evaluation framework, including a new Control of Object Manipulation (COM) metric, to assess controllability. Experiments show GEM excels at generating diverse, controllable scenarios and temporal consistency over long generations. Code, models, and datasets are fully open-sourced.

Type

conference paper not in proceedings

ArXiv ID

https://arxiv.org/abs/2412.11198v1

Author(s)

Hassan, Mariam

EPFL

Stapf, Sebastian

Rahimi, Ahmad

EPFL

Rezende, Pedro M. B.

Haghighi, Yasaman

Brüggemann, David

Swiss Data Science Center

Katircioglu, Isinsu

EPFL

Zhang, Lin

Chen, Xiaoran

Saha, Suman

Date Issued

2024-12-15

Subjects

Computer Science - Computer Vision and Pattern Recognition

Written at

EPFL

EPFL units

VITA

Event name	Event acronym	Event place	Event date
The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025	CVPR	Nashville, Tennessee, US	2025-06-11 - 2025-06-15

Available on Infoscience

May 5, 2025

Use this identifier to reference this record

https://infoscience.epfl.ch/handle/20.500.14299/249741