Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. 4M: Massively Multimodal Masked Modeling
 
conference paper

4M: Massively Multimodal Masked Modeling

Mizrahi, David  
•
Bachmann, Roman  
•
Kar, Oguzhan Fatih  
Show more
Oh, A
•
Neumann, T
Show more
January 1, 2023
Advances In Neural Information Processing Systems 36, Neurips 2023
37th Conference on Neural Information Processing Systems (NeurIPS)

Current machine learning models for vision are often highly specialized and limited to a single modality and task. In contrast, recent large language models exhibit a wide range of capabilities, hinting at a possibility for similarly versatile models in computer vision. In this paper, we take a step in this direction and propose a multimodal training scheme called 4M. It consists of training a single unified Transformer encoder-decoder using a masked modeling objective across a wide range of input/output modalities - including text, images, geometric, and semantic modalities, as well as neural network feature maps. 4M achieves scalability by unifying the representation space of all modalities through mapping them into discrete tokens and performing multimodal masked modeling on a small randomized subset of tokens.|4M leads to models that exhibit several key capabilities: (1) they can perform a diverse set of vision tasks out of the box, (2) they excel when fine-tuned for unseen downstream tasks or new input modalities, and (3) they can function as a generative model that can be conditioned on arbitrary modalities, enabling a wide variety of expressive multimodal editing capabilities with remarkable flexibility.|Through experimental analyses, we demonstrate the potential of 4M for training versatile and scalable foundation models for vision tasks, setting the stage for further exploration in multimodal learning for vision and other domains.

  • Details
  • Metrics
Type
conference paper
Web of Science ID

WOS:001202273400009

Author(s)
Mizrahi, David  
Bachmann, Roman  
Kar, Oguzhan Fatih  
Yeo, Teresa  
Gao, Mingfei
Dehghan, Afshin
Zamir, Amir  
Editors
Oh, A
•
Neumann, T
•
Globerson, A
•
Saenko, K
•
Hardt, M
•
Levine, S
Date Issued

2023-01-01

Publisher

Neural Information Processing Systems (Nips)

Publisher place

La Jolla

Published in
Advances In Neural Information Processing Systems 36, Neurips 2023
ISBN of the book

Subjects

Technology

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
VILAB  
Event nameEvent placeEvent date
37th Conference on Neural Information Processing Systems (NeurIPS)

New Orleans, LA

DEC 10-16, 2023

Available on Infoscience
June 19, 2024
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/208581
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés