Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. Bootstrapping Language-Audio Pre-training for Music Captioning
 
conference paper

Bootstrapping Language-Audio Pre-training for Music Captioning

Lanzendörfer, Luca A.
•
Pinkl, Constantin
•
Perraudin, Nathanaël  
Show more
Rao, Bhaskar D
•
Trancoso, Isabel
Show more
2025
Proceedings of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

We introduce BLAP, a model capable of generating high-quality captions for music. BLAP leverages a fine-tuned CLAP audio encoder and a pre-trained Flan-T5 large language model. To achieve effective cross-modal alignment between music and language, BLAP utilizes a Querying Transformer, allowing us to obtain state-of-the-art performance using 6x less data compared to previous models. This is a critical consideration given the scarcity of descriptive music data and the subjective nature of music interpretation. We provide qualitative examples demonstrating BLAP's ability to produce realistic captions for music, and perform a quantitative evaluation on three datasets. BLAP achieves a relative improvement on FENSE compared to previous models of 3.5%, 6.5%, and 7.5% on the MusicCaps, Song Describer, and YouTube8m-MTC datasets, respectively. The codebase is available at https://github.com/ETH-DISCO/blap.

  • Details
  • Metrics
Type
conference paper
DOI
10.1109/ICASSP49660.2025.10887618
Scopus ID

2-s2.0-105003871361

Author(s)
Lanzendörfer, Luca A.

ETH Zürich

Pinkl, Constantin

ETH Zürich

Perraudin, Nathanaël  

École Polytechnique Fédérale de Lausanne

Wattenhofer, Roger

ETH Zürich

Editors
Rao, Bhaskar D
•
Trancoso, Isabel
•
Sharma, Gaurav
•
Mehta, Neelesh B.
Date Issued

2025

Publisher

Institute of Electrical and Electronics Engineers Inc.

Publisher place

Piscataway, NJ - USA

Published in
Proceedings of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
DOI of the book
https://doi.org/10.1109/ICASSP49660.2025
ISBN of the book

9798350368741

Subjects

Contrastive Language-Audio Pre-training

•

Language Models

•

Music Captioning

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
EPFL  
Event nameEvent acronymEvent placeEvent date
ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

ICASSP 2025

Hyderabad, India

2025-04-06 - 2025-04-11

Available on Infoscience
May 12, 2025
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/250018
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés