Bootstrapping Language-Audio Pre-training for Music Captioning
We introduce BLAP, a model capable of generating high-quality captions for music. BLAP leverages a fine-tuned CLAP audio encoder and a pre-trained Flan-T5 large language model. To achieve effective cross-modal alignment between music and language, BLAP utilizes a Querying Transformer, allowing us to obtain state-of-the-art performance using 6x less data compared to previous models. This is a critical consideration given the scarcity of descriptive music data and the subjective nature of music interpretation. We provide qualitative examples demonstrating BLAP's ability to produce realistic captions for music, and perform a quantitative evaluation on three datasets. BLAP achieves a relative improvement on FENSE compared to previous models of 3.5%, 6.5%, and 7.5% on the MusicCaps, Song Describer, and YouTube8m-MTC datasets, respectively. The codebase is available at https://github.com/ETH-DISCO/blap.
2-s2.0-105003871361
ETH Zürich
ETH Zürich
École Polytechnique Fédérale de Lausanne
ETH Zürich
2025
Piscataway, NJ - USA
9798350368741
REVIEWED
EPFL
| Event name | Event acronym | Event place | Event date |
ICASSP 2025 | Hyderabad, India | 2025-04-06 - 2025-04-11 | |