Exploiting Representation Similarities in Self-Supervised Learning for Vision Tasks

Stegmüller, Thomas Grégoire

doi:10.5075/epfl-thesis-11179

doctoral thesis

Exploiting Representation Similarities in Self-Supervised Learning for Vision Tasks

2025

Recent progress in computer vision has been driven by a simple yet transformative observation: the potential of 30-year-old neural networks can be unlocked by massively scaling their parameters and the amount of human-labeled training data. This breakthrough was possible thanks to advances in hardware and significant reductions in its cost.

However, the bottleneck has recently shifted from computational resources to the availability of high-quality labeled data. Beyond the prohibitive cost of data annotation, labels are often noisy and can introduce biases into learned representations. In addition, representations learned under the supervised learning paradigm tend to be task-specific, which limits their transferability. These challenges have catalyzed the rise of self-supervised visual representation learning, which is the central focus of this thesis.

This manuscript places special emphasis on advancing self-supervised learning by leveraging the knowledge acquired by the model during training. First, we move beyond standard self-distillation objectives that enforce cross-view consistency by encouraging similarity between different images. This leverages the model's ability to identify valid pairs.

Next, we discuss self-supervised pre-training for dense downstream tasks by deriving supervisory signals at the object resolution rather than at the image resolution. This approach exploits the model's capacity to identify semantically coherent region pairs across views.

Building on these contributions, we design a distillation framework that enforces consistency between object pairs across different images, resulting in a model well-suited for in-context scene understanding. This enables visual tasks such as semantic segmentation using labeled examples, without the need for fine-tuning.

Finally, we target zero-shot semantic segmentation by aligning object-level representations across vision and language modalities, leveraging only image-caption pairs.