Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Journal articles
  4. SVGC-AVA: 360-Degree Video Saliency Prediction With Spherical Vector-Based Graph Convolution and Audio-Visual Attention
 
research article

SVGC-AVA: 360-Degree Video Saliency Prediction With Spherical Vector-Based Graph Convolution and Audio-Visual Attention

Yang, Qin
•
Li, Yuqi
•
Li, Chenglin
Show more
January 1, 2024
Ieee Transactions On Multimedia

Viewers of 360-degree videos are provided with both visual modality to characterize their surrounding views and audio modality to indicate the sound direction. Though both modalities are important for saliency prediction, little work has been done by jointly exploiting them, which is mainly due to the lack of audio-visual saliency datasets and insufficient exploitation of the multi-modality. In this article, we first construct an audio-visual saliency dataset with 57 360-degree videos watched by 63 viewers. Through a deep analysis of the constructed dataset, we find that the human gaze can be attracted by the auditory cues, resulting in a more concentrated saliency map if the sound source's location is further provided. To jointly exploit the visual and audio features and their correlation, we further design a saliency prediction network for 360-degree videos (SVGC-AVA) based on spherical vector-based graph convolution and audio-visual attention. The proposed spherical vector-based graph convolution can process visual and audio features directly in the sphere domain, thus avoiding projection distortion incurred by traditional CNN-based predictors. In addition, the audio-visual attention scheme explores self-modal and cross-modal correlation for both modalities, which are further hierarchically processed with the U-Net's multi-scale structure of SVGC-AVA. Evaluations on both our and public datasets validate that SVGC-AVA can achieve higher prediction accuracy, both qualitatively and subjectively.

  • Details
  • Metrics
Type
research article
DOI
10.1109/TMM.2023.3306596
Web of Science ID

WOS:001173355700009

Author(s)
Yang, Qin
Li, Yuqi
Li, Chenglin
Wang, Hao
Yan, Sa
Wei, Li
Dai, Wenrui
Zou, Junni
Xiong, Hongkai
Frossard, Pascal  
Date Issued

2024-01-01

Publisher

Ieee-Inst Electrical Electronics Engineers Inc

Published in
Ieee Transactions On Multimedia
Volume

26

Start page

3061

End page

3076

Subjects

Technology

•

Visualization

•

Feature Extraction

•

Convolution

•

Streaming Media

•

Correlation

•

Position Measurement

•

Kernel

•

360-Degree Videos

•

Saliency Prediction

•

Spherical Vector-Based Graph Convolution

•

Audio-Visual Attention

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
LTS4  
FunderGrant Number

National Natural Science Foundation of China

Available on Infoscience
April 17, 2024
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/207155
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés