Alahi, Alexandre MassoudLiu, Yuejiang2023-11-062023-11-062023-11-06202310.5075/epfl-thesis-9863https://infoscience.epfl.ch/handle/20.500.14299/202046Interactions are ubiquitous in our world, spanning from social interactions between human individuals to physical interactions between robots and objects to mechanistic interactions among different components of an intelligent system. Despite their prevalence, modern representation learning techniques still face difficulties in modeling complex interactions, especially in test environments that are not identical to the training one. In this dissertation, we aim to address this challenge by developing representations that can robustly generalize or efficiently adapt to unseen interactive settings. In the first part, we focus on learning representations of multi-agent interactions, where state distributions between training and test may vary in certain factors, such as agent density and behavioral style. To this end, we devise an inductive bias through the self-attention mechanism, which allows for modeling the collective influence of neighboring agents for autonomous navigation in dense spaces. We then propose a contrastive learning algorithm that incorporates prior knowledge of negative examples to alleviate covariate shifts in sequential decision making, thereby enabling more robust closed-loop operations. We further extend this approach to tackle other common distribution shifts in the motion context by enforcing modular structures of latent representations. We show that the resulting representations of interactions lead to stronger generalization and faster adaptation across environments. In the second part, we shift our focus to learning representations of visual observations, where distribution shifts may arise from unseen categories, compositions, corruptions, or other nuanced factors. For embodied agents interacting with objects, we show that representations that disentangle independent entities boost robustness in out-of-distribution visual reasoning. In more general settings, we investigate a test-time paradigm updating pre-trained representations based on unlabeled test examples. We first introduce a sampling algorithm of generative adversarial networks that recycles the interaction between the generator and discriminator at test time to adapt the sampled distribution. We then present a principled analysis of test-time training via self-supervised learning, which inspires our design of a new algorithm for tackling severe visual distribution shifts. We finally examine the test-time adaptation framework under broader conditions, revealing three pitfalls that can undermine its efficacy in practice.enMachine LearningRobot LearningDistribution ShiftsSelf-Supervised LearningCausal Representation LearningTest-Time AdaptationMulti-AgentMotion ForecastingLearning Robust and Adaptive Representations: from Interactions, for Interactionsthesis::doctoral thesis