Humans integrate multisensory information to reduce perceptual uncertainty when perceiving the world and self. Integration fails, however, if a common causality is not attributed to the sensory signals, as would occur in conditions of spatiotemporal discrepancies. In the case of passive self-motion, visual and vestibular cues are integrated according to statistical optimality, yet the extent of cue conflicts that do not compromise this optimality is currently underexplored. Here, we investigate whether human subjects can learn to integrate two arbitrary, but co-occurring, visual and vestibular cues of self-motion. Participants made size comparisons between two successive whole-body rotations using only visual, only vestibular, and both modalities together. The vestibular stimulus provided a yaw self-rotation cue, the visual a roll (Experiment 1) or pitch (Experiment 2) rotation cue. Experimentally measured thresholds in the bimodal condition were compared with theoretical predictions derived from the single-cue thresholds. Our results show that human subjects combine and optimally integrate vestibular and visual information, each signaling self-motion around a different rotation axis (yaw vs. roll and yaw vs. pitch). This finding suggests that the experience of two temporally co-occurring but spatially unrelated self-motion cues leads to inferring a common cause for these two initially unrelated sources of information about self-motion. We discuss our results in terms of specific task demands, cross-modal adaptation, and spatial compatibility. The importance of these results for the understanding of bodily illusions is also discussed.