Scaling the Modalities in Multimodal Foundation Models

Kar, Oguzhan Fatih

doi:10.5075/epfl-thesis-10572

doctoral thesis

Scaling the Modalities in Multimodal Foundation Models

2025

Having a single neural network to handle a wide and varied range of tasks and modalities has been a long-standing goal. Such a model brings notable advantages, such as test-time computational efficiency, modality fusion, and reduced model size. Our goal in this thesis is to make progress towards building unified multimodal foundation models that can process diverse inputs such as images, text, 3D, semantics, and other sensory data to solve a wide variety of real-world tasks including scene understanding, generation, and retrieval. Our approach addresses three core challenges: 1) obtaining diverse and high-quality training data, 2) building a scalable training framework, and 3) evaluation and benchmarking.

The first challenge we address is the scarcity of labeled data for multimodal training. As a remedy, one can use pseudolabels obtained from existing neural networks as a scalable way to generate data for different modalities. However, this becomes infeasible due to the brittleness of these models in the real world. To tackle this, in the first part of the thesis, we build robustness mechanisms to develop strong pseudolabeling networks and leverage off-the-shelf pretrained models. These mechanisms aim to handle real-world distribution shifts through 1) realistic data augmentations (3D Common Corruptions), 2) enforcing consistency constraints (Cross-Task Consistency), 3) diverse ensembling using self-supervised domains (Cross-Domain Ensembles) and pretrained vision backbones (BRAVE), and 4) test-time adaptation via error feedback (Rapid Network Adaptation).

Building on this, in the second part of the thesis, we integrate the data obtained from the resulting pseudolabelers and strong vision encoders into a unified training framework (4M). Using a multimodal training objective based on masked modeling and an "any-to-any" model architecture, we scale the training to tens of tasks and modalities and billions of model parameters. This approach, named 4M-21, enables diverse capabilities, including strong out-of-the-box vision performance, any-conditional & steerable generation, cross-modal retrieval and multi-sensory fusion, all in a single model.

Finally, we analyze the capabilities of the resulting model both qualitatively and quantitatively on a broad range of tasks, datasets, and benchmarks. Our evaluations also include a "status check" of the leading closed-weight multimodal foundation models (e.g., GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet) on several classical computer vision tasks (e.g., semantic segmentation, object detection, depth estimation) by developing prompt chaining techniques, enabling a direct comparison with specialist vision models. We find that while these models are respectable generalists, they are far from the state-of-the-art in all tasks, suggesting plenty of room for improvement in model development.

Type

doctoral thesis

DOI

10.5075/epfl-thesis-10572

Author(s)

Kar, Oguzhan Fatih

Advisors

Roshan Zamir, Amir

Jury

Prof. Lenka Zdeborová (présidente) ; Prof. Amir Roshan Zamir (directeur de thèse) ; Prof. Caglar Gulcehre, Prof. Saining Xie, Dr Josh Susskind (rapporteurs)

Date Issued

2025

Publisher

EPFL

Publisher place

Lausanne

Public defense year

2025-05-12

Thesis number

10572

Total of pages

199

Subjects

computer vision

•

deep learning

•

multimodality

•

foundation models

•

masked modeling

•

tokenization

•

robustness

•

distribution shifts

•

data augmentation

•

ensembling

EPFL units

VILAB

Faculty

IC

Doctoral School

EDIC

Available on Infoscience

May 5, 2025

Use this identifier to reference this record

https://infoscience.epfl.ch/handle/20.500.14299/249760