Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. Pan-Rsvqa: Vision Foundation Models as Pseudo-Annotators for Remote Sensing Visual Question Answering
 
conference paper

Pan-Rsvqa: Vision Foundation Models as Pseudo-Annotators for Remote Sensing Visual Question Answering

Chappuis, Christel  
•
Sumbul, Gencer  
•
Montariol, Syrielle  
Show more
2025
2025 IEEE/CFV Computer Society Conference on Computer Vision and Pattern Recognition Workshops. CVPRW 2025
2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops

While the quantity of Earth observation (EO) images is constantly increasing, the benefits that can be derived from these images are still limited by the required technical expertise to run information extraction pipelines. Using natural language to break this barrier, Remote Sensing Visual Question Answering (RSVQA) aims to make EO images usable by a wider, general public. Traditional RSVQA methods utilize a visual encoder to extract generic features from images, which are then fused with the features of the questions entered by users. Given their multi-task nature, Vision foundation models (VFMs) allow to go beyond such generic visual features, and can be seen as pseudo-annotators extracting diverse sets of features from a collection of inter-related tasks (objects detected, segmentation maps, scene descriptions etc.). In this work, we propose PAN-RSVQA, a new method combining a VFM and its pseudo-annotations with RSVQA by leveraging a transformer-based multi-modal encoder. These pseudoannotations bring diverse, naturally interpretable visual cues, as they are aligned with how humans reason about images: therefore, PAN-RSVQA not only exploits largescale training of VFMs but also enables accurate and interpretable RSVQA. Experiments on two datasets show results on par with the state-of-the-art while enabling enhanced interpretation of the model predictions, which we analyze via sample visual perturbations and ablations of the role of each pseudo-annotator. In addition, PAN-RSVQA is modular and easily extendable to new pseudo-annotators from other VFMs.

  • Details
  • Metrics
Type
conference paper
DOI
10.1109/CVPRW67362.2025.00283
Scopus ID

2-s2.0-105017852217

Author(s)
Chappuis, Christel  

École Polytechnique Fédérale de Lausanne

Sumbul, Gencer  

École Polytechnique Fédérale de Lausanne

Montariol, Syrielle  

École Polytechnique Fédérale de Lausanne

Lobry, Sylvain

Laboratoire d’Informatique Paris Descartes

Tuia, Devis  

École Polytechnique Fédérale de Lausanne

Date Issued

2025

Publisher

IEEE Computer Society

Published in
2025 IEEE/CFV Computer Society Conference on Computer Vision and Pattern Recognition Workshops. CVPRW 2025
DOI of the book
https://doi.org/10.1109/CVPRW67362.2025
ISBN of the book

9798331599942

Start page

2996

End page

3007

Subjects

earth observation

•

multi-modality

•

pseudo-annotations

•

remote sensing

•

rsvqa

•

vision foundation models

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
ECEO  
NLP  
Event nameEvent acronymEvent placeEvent date
2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops

CVPRW 2025

Nashville, TN, USA

2025-06-11 - 2025-06-12

Available on Infoscience
October 14, 2025
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/254936
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés