Chappuis, ChristelMendez, Vincent AlexandreWalt, EliotLobry, SylvainLe Saux, BertrandTuia, Devis2023-01-302023-01-302023-01-302022-0710.1109/IGARSS46834.2022.9884036https://infoscience.epfl.ch/handle/20.500.14299/194538Remote sensing visual question answering (RSVQA) opens new avenues to promote the use of satellites data, by interfacing satellite image analysis with natural language processing. Capitalizing on the remarkable advances in natural language processing and computer vision, RSVQA aims at finding an answer to a question formulated by a human user about a remote sensing image. This is achieved by extracting representations from images and questions, and then fusing them in a joint representation. Focusing on the language part of the architecture, this study compares and evaluates the adequacy to the RSVQA task of two language models, a traditional recurrent neural network (Skip-thoughts) and a recent attentionbased Transformer (BERT). We study whether large transformer models are beneficial to the task and whether fine-tuning is needed for these models to perform at their best. Our findings show that the models benefit from fine-tuning language models and that RSVQA with BERT is slightly but consistently better when properly fine-tuned.Remote Sensing Visual Question AnsweringLanguage Transformers for Remote Sensing Visual Question Answeringtext::conference output::conference proceedings::conference paper