WildCLIP: Scene and animal attribute retrieval from camera trap data with domain-adapted vision-language models

Gabeff, Valentin Alexandre Guy; Russwurm, Marc Conrad; Tuia, Devis; Mathis, Alexander

doi:10.5281/zenodo.11102888

Gabeff, Valentin Alexandre Guy; Russwurm, Marc Conrad; Tuia, Devis; Mathis, Alexander

2024

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Abstract

WildCLIP: Scene and animal attribute retrieval from camera trap data with domain-adapted vision-language models ############# Authors: Valentin Gabeff, Marc Russwurm, Devis Tuia & Alexander Mathis Affiliation: EPFL Date: January, 2024 Link to the article: https://link.springer.com/article/10.1007/s11263-024-02026-6 -------------------------------- WildCLIP is a fine-tuned CLIP model that allows to retrieve camera-trap events with natural language from the Snapshot Serengeti dataset. This project intends to demonstrate how vision-language models may assist the annotation process of camera-trap datasets. Here we provide the processed Snapshot Serengeti data used to train and evaluate WildCLIP, along with two versions of WildCLIP (model weights). Details on how to run these models can be found in the project github repository. Provided data (images and attribute annotations): The data consists of 380 x 380 image crops corresponding to the MegaDetector output of Snapshot Serengeti with a confidence threshold above 0.7. We considered only camera trap images containing single individuals. A description of the original data can be found on LILA here, released under the Community Data License Agreement (permissive variant). We warmly thank the authors of LILA for making the MegaDetector outputs publicly available, as well as for structuring the dataset and facilitating its access. Adapted CLIP model (model weights): WildCLIP models provided: [New] WildCLIP_vitb16_t1.pth: CLIP model with the ViT-B/16 visual backbone trained on data with captions following template 1. Trained on both base and novel vocabulary (see paper for details). [New] WildCLIP_vitb16_t1_lwf.pth: CLIP model with the ViT-B/16 visual backbone trained on data with captions following template 1, and with the additional VR-LwF loss. Trained on both base and novel vocabulary (see paper for details). WildCLIP_vitb16_t1_base.pth: CLIP model with the ViT-B/16 visual backbone trained on data with captions following template 1. Model used for evaluation and trained on base vocabulary only. (previously named WildCLIP_vitb16_t1.pth) WildCLIP_vitb16_t1t7_lwf_base.pth: CLIP model with the ViT-B/16 visual backbone trained on data with captions following templates 1 to 7, and with the additional VR-LwF loss. Model used for evaluation and trained on base vocabulary only. (previously named WildCLIP_vitb16_t1t7_lwf.pth) We also provide the CSV files containing the train / val / test splits. The train / test splits follow camera split from LILA (https://lila.science/datasets/snapshot-serengeti). The validation split is custom, and also at the camera level. train_dataset_crops_single_animal_template_captions_T1T7_ID.csv: Train set with captions from templates 1 through 7 (column "all captions") or template 1 only (column "template 1") val_dataset_crops_single_animal_template_captions_T1T7_ID.csv: Validation set with captions from templates 1 through 7 (column "all captions") or template 1 only (column "template 1") test_dataset_crops_single_animal_template_captions_T1T8T10.csv: Test set with captions from templates 1, 8, 9 and 10 (columns "all captions") Details on how the models were trained can be found in the associated publication. References: If you find our code, or weights, please cite: @article{gabeff2024wildclip, title={WildCLIP: Scene and animal attribute retrieval from camera trap data with domain-adapted vision-language models}, author={Gabeff, Valentin and Ru{\ss}wurm, Marc and Tuia, Devis and Mathis, Alexander}, journal={International Journal of Computer Vision}, pages={1--17}, year={2024}, publisher={Springer} } If you use the adapted Snapshot Serengeti data please also cite their article: @article{swanson2015snapshot, title={Snapshot Serengeti, high-frequency annotated camera trap images of 40 mammalian species in an African savanna}, author={Swanson, Alexandra and Kosmala, Margaret and Lintott, Chris and Simpson, Robert and Smith, Arfon and Packer, Craig}, journal={Scientific data}, volume={2}, number={1}, pages={1--14}, year={2015}, publisher={Nature Publishing Group} }

Details

Title WildCLIP: Scene and animal attribute retrieval from camera trap data with domain-adapted vision-language models

Author(s) Gabeff, Valentin Alexandre Guy ; Russwurm, Marc Conrad ; Tuia, Devis ; Mathis, Alexander

Date 2024

Publisher Zenodo

Version 2

Access Rights Open Access, (http://purl.org/coar/access_right/c_abf2)

Language English

Note ZENODO

DOI https://doi.org/10.5281/zenodo.11102888

Related to Is Supplement To (https://infoscience.epfl.ch/record/310883)
Is New Version Of (10.5281/zenodo.10479316)

Laboratories ECEO

Record Appears in Scientific production and competences > ENAC - School of Architecture, Civil and Environmental Engineering > IIE - Environmental Engineering Institute > ECEO - Environmental Computational Science and Earth Observation Laboratory
Work produced at EPFL
Datasets

Record creation date 2024-06-20