Temporal Human Visual Attention in Window Views: Dynamic Gaze Patterns Analysis and Deep Learning-Based Spatio-Temporal Saliency Estimation

Poletto, Arnaud

master thesis

Poletto, Arnaud

August 19, 2025

Windows in modern architectural environments serve as essential interfaces connecting indoor spaces to dynamic outdoor views. Although traditional view assessment frameworks have provided valuable insight into static view properties, they fall short of capturing the temporal and dynamic aspects of real-world scenes. Addressing this gap, this thesis introduces a novel approach that combines eye tracking data, advanced computational analysis, and deep learning to model how occupants visually engage with window views over time.

Using the new ViewOut dataset, which incorporates moving content using real-time video capture with virtual reality-based gaze tracking and video recordings of real-world scenes, this research systematically investigates the factors influencing visual attention in view-out scenarios. Our analysis reveals that: • Primary visual features (e.g., contrast and color saturation) attract attention significantly above chance level—even when controlling for general fixation tendencies. • Depth cues lead to increased attention toward distant elements, with a higher fixation count on background features. • Human figures attract stronger attention than vehicles, while both resulting in significantly more fixations than the average fixation density across scenes, highlighting the particular salience of social and semantic objects. • Dynamic objects, such as moving vehicles and pedestrians, capture and sustain attention significantly more than non-moving elements.

Based on these findings, this work develops the Spatio-Temporal Attentive Message Passing Graph Neural Network (STAMP-GNN), a deep learning model capable of saliency prediction across multiple input modalities (images or videos) and prediction tasks (global or temporal attention patterns). Key innovations of this model include: • An attentive message passing mechanism to capture spatio-temporal relationships within videos. • The capability to predict temporal attention patterns from both image and video inputs while improving global saliency predictions. • Competitive performance across diverse saliency benchmarks such as our own view out dataset and standard benchmarks such as SALICON and DHF1K datasets, demonstrating the model's effectiveness across different contexts.

The results demonstrate how this interdisciplinary approach, which integrates computer vision techniques with traditional built environment analysis, can advance view quality assessment beyond static evaluations to capture dynamic visual engagement patterns. This research provides a foundation for incorporating dynamic gaze behaviors into architectural design, enabling more engaging and user-centric environments.