A Simple Framework for Open-Vocabulary Zero-Shot Segmentation

Stegmüller, Thomas; Lebailly, Tim; Ðukić, Nikola; Bozorgtabar, Behzad; Tuytelaars, Tinne; Thiran, Jean Philippe

conference paper

Stegmüller, Thomas

•

Lebailly, Tim

•

Ðukić, Nikola

Vorobeychik, Yevgeniy

•

Das, Sanmay

2025

13th International Conference on Learning Representations, ICLR 2025

The Thirteenth International Conference on Learning Representations

Zero-shot classification capabilities naturally arise in models trained within a vision-language contrastive framework. Despite their classification prowess, these models struggle in dense tasks like zero-shot open-vocabulary segmentation. This deficiency is often attributed to the absence of localization cues in captions and the intertwined nature of the learning process, which encompasses both image/text representation learning and cross-modality alignment. To tackle these issues, we propose SimZSS, a Simple framework for open-vocabulary Zero-Shot Segmentation. The method is founded on two key principles: i) leveraging frozen vision-only models that exhibit spatial awareness while exclusively aligning the text encoder and ii) exploiting the discrete nature of text and linguistic knowledge to pinpoint local concepts within captions. By capitalizing on the quality of the visual representations, our method requires only image-caption pair datasets and adapts to both small curated and large-scale noisy datasets. When trained on COCO Captions across 8 GPUs, SimZSS achieves state-of-the-art results on 7 out of 8 benchmark datasets in less than 15 minutes. Our code and pretrained models are publicly available at https://github.com/tileb1/simzss.

Type

conference paper

Scopus ID

2-s2.0-105010207910

Author(s)

Stegmüller, Thomas

École Polytechnique Fédérale de Lausanne

Lebailly, Tim

KU Leuven

Ðukić, Nikola

KU Leuven

Bozorgtabar, Behzad

École Polytechnique Fédérale de Lausanne

Tuytelaars, Tinne

KU Leuven

Thiran, Jean Philippe

École Polytechnique Fédérale de Lausanne

Editors

Vorobeychik, Yevgeniy

•

Das, Sanmay

•

Nowe, Ann

Date Issued

2025

Publisher

International Conference on Learning Representations, ICLR

Published in

13th International Conference on Learning Representations, ICLR 2025

ISBN of the book

9798331320850

Series title/Series vol.

RILEM Bookseries; 57

ISSN (of the series)

2211-0852

2211-0844

Published in

Transactions on Machine Learning Research

Volume

2025-July

Start page

72821

End page

72842

Subjects

Behavior encapsulation

•

Continual learning

•

Planning

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units

LTS5

Event name	Event acronym	Event place	Event date
The Thirteenth International Conference on Learning Representations	ICLR 2025	Singapore	2025-04-24-2025-04-28

Funder	Funding(s)	Grant Number	Grant URL
Flemish Government
Onderzoeksprogramma Artificiele Intelligentie
European Research Council
Show more

Available on Infoscience

July 25, 2025

Use this identifier to reference this record

https://infoscience.epfl.ch/handle/20.500.14299/252654