Multi-task prompt-RSVQA to explicitly count objects on aerial images

Tartini-Chappuis, Christel; Sertic, Charlotte; Santacroce, Nicolas; Castillo Navarro, Javiera; Lobry, Sylvain; Le Saux, Bertrand; Tuia, Devis

conference paper not in proceedings

Tartini-Chappuis, Christel

•

Sertic, Charlotte

•

Santacroce, Nicolas

September 1, 2023

British Machine Vision Conference (BMVC) workshops

Introduced to enable a wider use of Earth Observation images using natural language, Remote Sensing Visual Question Answering (RSVQA) remains a challenging task, in particular for questions related to counting. To address this specific challenge, we propose a modular Multi-task prompt-RSVQA model based on object detection and question answering modules. By creating a semantic bottleneck describing the image and providing a visual answer, our model allows users to assess the visual grounding of the answer and better interpret the prediction. A set of ablation studies are designed to consider the contributions of different modules and evaluation metrics are discussed for a finer-grained assessment. Experiments demonstrate competitive results against literature baselines and a zero-shot VQA model. In particular, our proposed model predicts answers for numerical Counting questions that are consistently closer in distance to the ground truth.

Name

6.pdf

Type

N/a

Access type

openaccess

License Condition

CC BY

Size

2.46 MB

Format

Adobe PDF

Checksum (MD5)

f7cb4ed25807e8a4e6b635af01e8de3e