Visual Saliency Prediction for Natural Images, Comics Panels, and Comics Pages
Recent years have seen remarkable advancements in saliency estimation methods, mainly due to deep learning models leveraging the widespread availability of real-world images. However, saliency is profoundly shaped by the intricacies of the human visual attention system, extending beyond the mere utilization of large-scale data and powerful models. To address this, we incorporate specific characteristics of the human visual system into deep learning approaches, with the aim of improving saliency prediction. Moreover, current saliency prediction approaches do not generalize to domains characterized by limited data, such as cartoons, sketches, or comics. This challenge is pronounced by the disparity between the photographic domain and domains with sparse data. To bridge the gap between deep learning approaches and the human visual system, and to overcome the limitations of saliency prediction in the comics domain, we adopt a multifaceted approach: we model dissimilarities among objects within content-rich scenes to account for relationships between objects; we consider the temporal dynamics of attention since the attention evolves through time and we introduce a data augmentation method based on photometric alterations for saliency prediction. These methods, collectively, lead to a more precise and dynamic understanding of saliency in both natural images and comics. In the first research axis, we introduce a saliency prediction model that explicitly models the object dissimilarities in content-rich real-world photographic scenes. We calculate the size and appearance dissimilarities of the objects to fuse with the deep saliency features. We show that incorporating these dissimilarities enhances saliency prediction in natural images. In the second one, we study the temporal dimension of saliency. We make use of temporal information for improving saliency prediction since we look at different regions of an image over time. Specifically, we learn time-specific saliency predictions by exploiting temporal information. We show that the temporally evolving patterns in human attention play an important role in saliency prediction in natural images. Saliency prediction models are constrained by the limited diversity and quantity of labeled data. Standard data augmentation techniques such as cropping, and rotating change the scene composition hence affecting saliency. Therefore, we introduce a novel data augmentation method for deep saliency prediction that involves editing contrast, brightness, and color while preserving the overall structure of the scenes. This approach enables us to generate images that closely resemble the photometric characteristics of the target domains. Lastly, we analyze these methods in the domain of comics, which feature stylized elements, sequential reading, and artistic use of brightness, contrast, and color to emphasize story elements and convey emotions. We mitigate the disparities between saliency prediction in natural images and comics through our earlier contributions, which encompass object dissimilarity, temporal aspects, and adjustments to brightness, color, and contrast. In summary, we study visual attention, gaze behavior, and their estimation with deep neural networks in the context of natural images and comics. We advance our understanding of visual attention and saliency prediction, benefiting both natural images and comics, and pushing the boundaries of saliency prediction across diverse visual domains.
EPFL_TH9982.pdf
Main Document
http://purl.org/coar/version/c_be7fb7dd8ff6fe43
openaccess
N/A
93.46 MB
Adobe PDF
e152d197c45ba9579904040249c1a8f0