SVGC-AVA: 360-Degree Video Saliency Prediction With Spherical Vector-Based Graph Convolution and Audio-Visual Attention

Yang, Qin; Li, Yuqi; Li, Chenglin; Wang, Hao; Yan, Sa; Wei, Li; Dai, Wenrui; Zou, Junni; Xiong, Hongkai; Frossard, Pascal

doi:10.1109/TMM.2023.3306596

research article

SVGC-AVA: 360-Degree Video Saliency Prediction With Spherical Vector-Based Graph Convolution and Audio-Visual Attention

Yang, Qin

•

Li, Yuqi

•

Li, Chenglin

January 1, 2024

Ieee Transactions On Multimedia

Viewers of 360-degree videos are provided with both visual modality to characterize their surrounding views and audio modality to indicate the sound direction. Though both modalities are important for saliency prediction, little work has been done by jointly exploiting them, which is mainly due to the lack of audio-visual saliency datasets and insufficient exploitation of the multi-modality. In this article, we first construct an audio-visual saliency dataset with 57 360-degree videos watched by 63 viewers. Through a deep analysis of the constructed dataset, we find that the human gaze can be attracted by the auditory cues, resulting in a more concentrated saliency map if the sound source's location is further provided. To jointly exploit the visual and audio features and their correlation, we further design a saliency prediction network for 360-degree videos (SVGC-AVA) based on spherical vector-based graph convolution and audio-visual attention. The proposed spherical vector-based graph convolution can process visual and audio features directly in the sphere domain, thus avoiding projection distortion incurred by traditional CNN-based predictors. In addition, the audio-visual attention scheme explores self-modal and cross-modal correlation for both modalities, which are further hierarchically processed with the U-Net's multi-scale structure of SVGC-AVA. Evaluations on both our and public datasets validate that SVGC-AVA can achieve higher prediction accuracy, both qualitatively and subjectively.

Type

research article

DOI

10.1109/TMM.2023.3306596

Web of Science ID

WOS:001173355700009

Author(s)

Yang, Qin

Li, Yuqi

Li, Chenglin

Wang, Hao

Yan, Sa

Wei, Li

Dai, Wenrui

Zou, Junni

Xiong, Hongkai

Frossard, Pascal

Date Issued

2024-01-01

Publisher

Ieee-Inst Electrical Electronics Engineers Inc

Published in

Ieee Transactions On Multimedia

Volume

26

Start page

3061

End page

3076

Subjects

Technology

•

Visualization

•

Feature Extraction

•

Convolution

•

Streaming Media

•

Correlation

•

Position Measurement

•

Kernel

•

360-Degree Videos

•

Saliency Prediction

•

Spherical Vector-Based Graph Convolution

•

Audio-Visual Attention

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units

LTS4

Funder	Grant Number
National Natural Science Foundation of China

Available on Infoscience

April 17, 2024

Use this identifier to reference this record

https://infoscience.epfl.ch/handle/20.500.14299/207155