CrossFeat: Semantic Cross-modal Attention for Pedestrian Behavior Forecasting
Forecasting pedestrian behaviors is essential for autonomous vehicles to ensure safety in urban scenarios. Previous works addressed this problem based on motion alone, omitting several additional behavioral cues helping understanding pedestrians’ true intentions. We address the problem of forecasting pedestrian actions through joint reasoning about pedestrians’ past behaviors and their surrounding environments. For this, we propose a Transformer-based feature fusion approach, where multi-modal inputs about pedestrians and environments are all mapped into a common space, then jointly processed through self and cross-attention mechanisms to take context into account. We also use a semantic segmentation map of the current input frame, rather than the full temporal visual stream, to further focus on semantic reasoning. We experimentally validate and analyze our approach on two benchmarks about pedestrian crossing and Stop&Go motion changes, which rely on several standard self-driving datasets centered around interactions with pedestrians (JAAD, PIE, TITAN), and show that our semantic joint reasoning yields state-of-the-art results.
10.1109_tiv.2024.3449046.pdf
Main Document
Published version
openaccess
CC BY
3.98 MB
Adobe PDF
69eef092c421f8b52b2a13090b6b9571