Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. Exploiting Representation Similarities in Self-Supervised Learning for Vision Tasks
 
doctoral thesis

Exploiting Representation Similarities in Self-Supervised Learning for Vision Tasks

Stegmüller, Thomas Grégoire  
2025

Recent progress in computer vision has been driven by a simple yet transformative observation: the potential of 30-year-old neural networks can be unlocked by massively scaling their parameters and the amount of human-labeled training data. This breakthrough was possible thanks to advances in hardware and significant reductions in its cost.

However, the bottleneck has recently shifted from computational resources to the availability of high-quality labeled data. Beyond the prohibitive cost of data annotation, labels are often noisy and can introduce biases into learned representations. In addition, representations learned under the supervised learning paradigm tend to be task-specific, which limits their transferability. These challenges have catalyzed the rise of self-supervised visual representation learning, which is the central focus of this thesis.

This manuscript places special emphasis on advancing self-supervised learning by leveraging the knowledge acquired by the model during training. First, we move beyond standard self-distillation objectives that enforce cross-view consistency by encouraging similarity between different images. This leverages the model's ability to identify valid pairs.

Next, we discuss self-supervised pre-training for dense downstream tasks by deriving supervisory signals at the object resolution rather than at the image resolution. This approach exploits the model's capacity to identify semantically coherent region pairs across views.

Building on these contributions, we design a distillation framework that enforces consistency between object pairs across different images, resulting in a model well-suited for in-context scene understanding. This enables visual tasks such as semantic segmentation using labeled examples, without the need for fine-tuning.

Finally, we target zero-shot semantic segmentation by aligning object-level representations across vision and language modalities, leveraging only image-caption pairs.

  • Files
  • Details
  • Metrics
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés