Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval

Qi, Mengshi; Qin, Jie; Yang, Yi; Wang, Yunhong; Luo, Jiebo

doi:10.1109/TIP.2020.3048680

research article

Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval

Qi, Mengshi

•

Qin, Jie

•

Yang, Yi

more

January 1, 2021

IEEE Transactions On Image Processing

With the current exponential growth of video-based social networks, video retrieval using natural language is receiving ever-increasing attention. Most existing approaches tackle this task by extracting individual frame-level spatial features to represent the whole video, while ignoring visual pattern consistencies and intrinsic temporal relationships across different frames. Furthermore, the semantic correspondence between natural language queries and person-centric actions in videos has not been fully explored. To address these problems, we propose a novel binary representation learning framework, named Semanticsaware Spatial-temporal Binaries (S(2)Bin), which simultaneously considers spatial-temporal context and semantic relationships for cross-modal video retrieval. By exploiting the semantic relationships between two modalities, S(2)Bin can efficiently and effectively generate binary codes for both videos and texts. In addition, we adopt an iterative optimization scheme to learn deep encoding functions with attribute-guided stochastic training. We evaluate our model on three video datasets and the experimental results demonstrate that S(2)Bin outperforms the state-of-the-art methods in terms of various cross-modal video retrieval tasks.

Use this identifier to reference this record

https://infoscience.epfl.ch/handle/20.500.14299/176673

Type

research article

DOI

10.1109/TIP.2020.3048680

Web of Science ID

WOS:000621399700001

Authors

Qi, Mengshi

•

Qin, Jie

•

Yang, Yi

•

Wang, Yunhong

•

Luo, Jiebo

Publication date

2021-01-01

Published in

IEEE Transactions On Image Processing

Volume

30

Start page

2989

End page

3004

Subjects

Computer Science, Art...

Engineering, Electric...

Computer Science

Engineering

semantics

binary codes

feature extraction

visualization

task analysis

natural languages

stochastic processes

cross-modal hashing

video retrieval

binary representation...

spatial-temporal feat...

natural language

Peer reviewed

REVIEWED

EPFL units

CVLAB

Available on Infoscience

March 26, 2021