The Vernissage Corpus: A Multimodal Human-Robot-Interaction Dataset

We introduce a new multimodal interaction dataset with extensive annotations in a conversational Human-Robot-Interaction (HRI) scenario. It has been recorded and annotated to benchmark many relevant perceptual tasks, towards enabling a robot to converse with multiple humans, such as speaker localization, key word spotting, speech recognition in audio domain; tracking, pose estimation, nodding, visual focus of attention estimation in visual domain; and an audio-visual task such as addressee detection. Some of the above mentioned tasks could benefit from information sensed from several modalities and recorded states of the robot. As compared to recordings done with a static camera, this corpus involves the head-movement of a humanoid robot (due to gaze change, nodding), making it challenging for tracking. Also, the significant background noise present in a real HRI setting makes tasks in the auditory domain more challenging. From the interaction point of view, our scenario, where the robot explains paintings in a room and then quizzes the participants, allows to analyze the quality of the interaction and the behavior of the human interaction partners.

Related material