Having a single neural network to handle a wide and varied range of tasks and modalities has been a long-standing goal. Such a model brings notable advantages, such as test-time computational efficiency, modality fusion, and reduced model size. Our goal in this thesis is to make progress towards building unified multimodal foundation models that can process diverse inputs such as images, text, 3D, semantics, and other sensory data to solve a wide variety of real-world tasks including scene understanding, generation, and retrieval. Our approach addresses three core challenges: 1) obtaining diverse and high-quality training data, 2) building a scalable training framework, and 3) evaluation and benchmarking.
The first challenge we address is the scarcity of labeled data for multimodal training. As a remedy, one can use pseudolabels obtained from existing neural networks as a scalable way to generate data for different modalities. However, this becomes infeasible due to the brittleness of these models in the real world. To tackle this, in the first part of the thesis, we build robustness mechanisms to develop strong pseudolabeling networks and leverage off-the-shelf pretrained models. These mechanisms aim to handle real-world distribution shifts through 1) realistic data augmentations (3D Common Corruptions), 2) enforcing consistency constraints (Cross-Task Consistency), 3) diverse ensembling using self-supervised domains (Cross-Domain Ensembles) and pretrained vision backbones (BRAVE), and 4) test-time adaptation via error feedback (Rapid Network Adaptation).
Building on this, in the second part of the thesis, we integrate the data obtained from the resulting pseudolabelers and strong vision encoders into a unified training framework (4M). Using a multimodal training objective based on masked modeling and an "any-to-any" model architecture, we scale the training to tens of tasks and modalities and billions of model parameters. This approach, named 4M-21, enables diverse capabilities, including strong out-of-the-box vision performance, any-conditional & steerable generation, cross-modal retrieval and multi-sensory fusion, all in a single model.
Finally, we analyze the capabilities of the resulting model both qualitatively and quantitatively on a broad range of tasks, datasets, and benchmarks. Our evaluations also include a "status check" of the leading closed-weight multimodal foundation models (e.g., GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet) on several classical computer vision tasks (e.g., semantic segmentation, object detection, depth estimation) by developing prompt chaining techniques, enabling a direct comparison with specialist vision models. We find that while these models are respectable generalists, they are far from the state-of-the-art in all tasks, suggesting plenty of room for improvement in model development.
EPFL_TH10572.pdf
Main Document
openaccess
N/A
68.52 MB
Adobe PDF
a981029df2afe3442c42ca660ece42da