Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. Scaling the Modalities in Multimodal Foundation Models
 
doctoral thesis

Scaling the Modalities in Multimodal Foundation Models

Kar, Oguzhan Fatih  
2025

Having a single neural network to handle a wide and varied range of tasks and modalities has been a long-standing goal. Such a model brings notable advantages, such as test-time computational efficiency, modality fusion, and reduced model size. Our goal in this thesis is to make progress towards building unified multimodal foundation models that can process diverse inputs such as images, text, 3D, semantics, and other sensory data to solve a wide variety of real-world tasks including scene understanding, generation, and retrieval. Our approach addresses three core challenges: 1) obtaining diverse and high-quality training data, 2) building a scalable training framework, and 3) evaluation and benchmarking.

The first challenge we address is the scarcity of labeled data for multimodal training. As a remedy, one can use pseudolabels obtained from existing neural networks as a scalable way to generate data for different modalities. However, this becomes infeasible due to the brittleness of these models in the real world. To tackle this, in the first part of the thesis, we build robustness mechanisms to develop strong pseudolabeling networks and leverage off-the-shelf pretrained models. These mechanisms aim to handle real-world distribution shifts through 1) realistic data augmentations (3D Common Corruptions), 2) enforcing consistency constraints (Cross-Task Consistency), 3) diverse ensembling using self-supervised domains (Cross-Domain Ensembles) and pretrained vision backbones (BRAVE), and 4) test-time adaptation via error feedback (Rapid Network Adaptation).

Building on this, in the second part of the thesis, we integrate the data obtained from the resulting pseudolabelers and strong vision encoders into a unified training framework (4M). Using a multimodal training objective based on masked modeling and an "any-to-any" model architecture, we scale the training to tens of tasks and modalities and billions of model parameters. This approach, named 4M-21, enables diverse capabilities, including strong out-of-the-box vision performance, any-conditional & steerable generation, cross-modal retrieval and multi-sensory fusion, all in a single model.

Finally, we analyze the capabilities of the resulting model both qualitatively and quantitatively on a broad range of tasks, datasets, and benchmarks. Our evaluations also include a "status check" of the leading closed-weight multimodal foundation models (e.g., GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet) on several classical computer vision tasks (e.g., semantic segmentation, object detection, depth estimation) by developing prompt chaining techniques, enabling a direct comparison with specialist vision models. We find that while these models are respectable generalists, they are far from the state-of-the-art in all tasks, suggesting plenty of room for improvement in model development.

  • Files
  • Details
  • Metrics
Type
doctoral thesis
DOI
10.5075/epfl-thesis-10572
Author(s)
Kar, Oguzhan Fatih  
Advisors
Roshan Zamir, Amir  
Jury

Prof. Lenka Zdeborová (présidente) ; Prof. Amir Roshan Zamir (directeur de thèse) ; Prof. Caglar Gulcehre, Prof. Saining Xie, Dr Josh Susskind (rapporteurs)

Date Issued

2025

Publisher

EPFL

Publisher place

Lausanne

Public defense year

2025-05-12

Thesis number

10572

Total of pages

199

Subjects

computer vision

•

deep learning

•

multimodality

•

foundation models

•

masked modeling

•

tokenization

•

robustness

•

distribution shifts

•

data augmentation

•

ensembling

EPFL units
VILAB  
Faculty
IC  
Doctoral School
EDIC  
Available on Infoscience
May 5, 2025
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/249760
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés