Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. BRAVE: Broadening the Visual Encoding of Vision-Language Models
 
conference paper

BRAVE: Broadening the Visual Encoding of Vision-Language Models

Kar, Oğuzhan Fatih
•
Tonioni, Alessio
•
Poklukar, Petra
Show more
Leonardis, Aleš
•
Ricci, Elisa
Show more
2025
Computer Vision – ECCV 2024 - 18th European Conference, Proceedings
18th European Conference on Computer Vision

Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders, e.g. “blindness” to certain image features, visual hallucination, etc. To address these issues, we study broadening the visual encoding capabilities of VLMs. We first comprehensively benchmark several vision encoders with different inductive biases for solving VLM tasks. We observe that there is no single encoding configuration that consistently achieves top performance across different tasks, and encoders with different biases can perform surprisingly similarly. Motivated by this, we introduce a method, named BRAVE, that consolidates features from multiple frozen encoders into a more versatile representation that can be directly fed as the input to a frozen LM. BRAVE achieves state-of-the-art performance on a broad range of captioning and VQA benchmarks and significantly reduces the aforementioned issues of VLMs, while requiring a smaller number of trainable parameters than existing methods and having a more compressed representation. Our results highlight the potential of incorporating different visual biases for a more broad and contextualized visual understanding of VLMs.

  • Details
  • Metrics
Type
conference paper
DOI
10.1007/978-3-031-72640-8_7
Scopus ID

2-s2.0-85209790447

Author(s)
Kar, Oğuzhan Fatih

Google Switzerland GmbH

Tonioni, Alessio

Google Switzerland GmbH

Poklukar, Petra

Google Switzerland GmbH

Kulshrestha, Achin

Google Switzerland GmbH

Zamir, Amir  

École Polytechnique Fédérale de Lausanne

Tombari, Federico

Google Switzerland GmbH

Editors
Leonardis, Aleš
•
Ricci, Elisa
•
Roth, Stefan
•
Russakovsky, Olga
•
Sattler, Torsten
•
Varol, Gül
Date Issued

2025

Publisher

Springer Science and Business Media Deutschland GmbH

Published in
Computer Vision – ECCV 2024 - 18th European Conference, Proceedings
Series title/Series vol.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 15074 LNCS

ISSN (of the series)

1611-3349

0302-9743

Start page

113

End page

132

Subjects

Multi-modal Learning

•

Vision-Language

•

Visual Encoding

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
VILAB  
Event nameEvent acronymEvent placeEvent date
18th European Conference on Computer Vision

Milan, Italy

2024-09-29 - 2024-10-04

Available on Infoscience
January 26, 2025
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/244839
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés