A Simple Framework for Open-Vocabulary Zero-Shot Segmentation
Zero-shot classification capabilities naturally arise in models trained within a vision-language contrastive framework. Despite their classification prowess, these models struggle in dense tasks like zero-shot open-vocabulary segmentation. This deficiency is often attributed to the absence of localization cues in captions and the intertwined nature of the learning process, which encompasses both image/text representation learning and cross-modality alignment. To tackle these issues, we propose SimZSS, a Simple framework for open-vocabulary Zero-Shot Segmentation. The method is founded on two key principles: i) leveraging frozen vision-only models that exhibit spatial awareness while exclusively aligning the text encoder and ii) exploiting the discrete nature of text and linguistic knowledge to pinpoint local concepts within captions. By capitalizing on the quality of the visual representations, our method requires only image-caption pair datasets and adapts to both small curated and large-scale noisy datasets. When trained on COCO Captions across 8 GPUs, SimZSS achieves state-of-the-art results on 7 out of 8 benchmark datasets in less than 15 minutes. Our code and pretrained models are publicly available at https://github.com/tileb1/simzss.
2-s2.0-105010207910
École Polytechnique Fédérale de Lausanne
KU Leuven
KU Leuven
École Polytechnique Fédérale de Lausanne
KU Leuven
École Polytechnique Fédérale de Lausanne
2025
9798331320850
RILEM Bookseries; 57
2211-0852
2211-0844
2025-July
72821
72842
REVIEWED
EPFL
| Event name | Event acronym | Event place | Event date |
ICLR 2025 | Singapore | 2025-04-24-2025-04-28 | |
| Funder | Funding(s) | Grant Number | Grant URL |
Flemish Government | |||
Onderzoeksprogramma Artificiele Intelligentie | |||
European Research Council | |||
| Show more | |||