Geospatial information is embedded in both natural systems and human society, making it inherently multimodal and heterogeneous. It describes objects and events associated with geospatial locations on Earth, with various forms such as remote sensing imagery, ground-level imagery, and textual descriptions.
With the rise of geospatial artificial intelligence (AI), multimodal learning has become a key to integrating these diverse sources. However, the inherent heterogeneity across modalities poses significant challenges, constraining both the development and generalizability of geospatial AI systems. Considering the sparsity of aligned data across modalities, external knowledge offers a potential solution. By providing modality-agnostic semantics and priors and guiding the multimodal learning process, it plays a critical role in bridging modality gaps.
Despite efforts to incorporate external knowledge into multimodal systems, the current understanding of how external knowledge integration influences multimodal complementarity, alignment, and representation remains limited, particularly within geospatial contexts. To address these challenges, this dissertation investigates knowledge-aware multimodal systems that integrate diverse types of knowledge, including commonsense and geospatial knowledge, across multiple modalities with a focus on geospatial vision-language and geo-localization tasks.
For geospatial vision-language applications, this dissertation investigates the complementarity of external knowledge to multimodal learning. To improve cross-modal alignment, KTIR incorporates external commonsense knowledge to bridge semantic gaps between remote sensing imagery and textual descriptions by expanding the semantic scope of text. In addition, ConVQG integrates external knowledge into multimodal fusion by proposing a contrastive visual question generation pipeline that generates knowledge-enriched, image-grounded questions.
In the context of geo-localization, this dissertation develops generalizable representations for both retrieval-based and navigation-based approaches by incorporating external knowledge in the learning process. For retrieval-based geo-localization, ConGeo proposes a model-agnostic pipeline that integrates the location prior in the contrastive learning objective to achieve robust cross-modal alignment between ground and aerial views. Building on this alignment, GeoExplorer further addresses the generalizability of representation in active, navigation-based geo-localization and proposes a curiosity-driven reinforcement learning pipeline to emphasize comprehensive environment modeling and exploration for geo-localization.
Taken together, the dissertation highlights the multifaceted complementarity of external knowledge in geospatial multimodal information and introduces knowledge-aware systems that leverage such knowledge to improve cross-modal alignment and enhance the generalizability of geospatial representations. With these contributions, the dissertation aims to foster knowledge-enriched, robust, and practically applicable geospatial AI systems that leverage the distinctiveness of geospatial information and accommodate real-world complexity.
EPFL_TH11623.pdf
Main Document
Published version
openaccess
N/A
89.76 MB
Adobe PDF
1d7f918d49778d7cf29b59abf65f4bf4