Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. VETIM: Expanding the Vocabulary of Text-to-Image Models only with Text
 
conference paper

VETIM: Expanding the Vocabulary of Text-to-Image Models only with Text

Everaert, Martin Nicolas  
•
Bocchio, Marco
•
Arpa, Sami
Show more
November 2023
34th British Machine Vision Conference 2023
The 34th British Machine Vision Conference (BMVC 2023)

Text-to-image models, such as Stable Diffusion, can generate high-quality images from simple textual prompts. With methods such as Textual Inversion, it is possible to expand the vocabulary of these models with additional concepts, by learning the vocabulary embedding of new tokens. These methods have two limitations: slowness of optimisation and dependence on sample images. Slowness mainly stems from the use of the original text-to-image training loss, without considering potential auxiliary supervision terms. Relying on sample images enables learning new visual features but restricts the vocabulary expansion to concepts with pre-existing images. In response, we introduce a novel approach, named VETIM, which takes only a textual description of the concept as input. It expands the vocabulary through supervision only at the text encoder output, without accessing the image-generation part, making it faster at optimisation time. It also does not copy visual features from sample images. Our method can be used directly for applications that require a concept as a single token but do not require learning new visual features. Our approach shows that a mere textual description suffices to obtain a single token referring to a specific concept. To show the effectiveness of our method, we evaluate its performance subjectively and through objective measures. The results show that our approach is effective in expanding the vocabulary of text-to-image models without requiring images.

  • Files
  • Details
  • Metrics
Type
conference paper
Author(s)
Everaert, Martin Nicolas  

EPFL

Bocchio, Marco
Arpa, Sami
Süsstrunk, Sabine  

EPFL

Achanta, Radhakrishna  
Date Issued

2023-11

Publisher

BMVA

Published in
34th British Machine Vision Conference 2023
URL

Link to the conference paper

https://proceedings.bmvc2023.org/16/

Link to the proceedings

https://proceedings.bmvc2023.org/

Link to the project

https://github.com/IVRL/vetim
Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
IVRL  
Event nameEvent acronymEvent placeEvent date
The 34th British Machine Vision Conference (BMVC 2023)

BMVC

Aberdeen, UK

2023-11-20 - 2023-11-24

Available on Infoscience
December 13, 2023
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/202617
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés