This paper presents the MuMMER data set, a data set for human-robot interaction scenarios that is available for research purposes% It comprises 1 h 29 min of multimodal recordings of people interacting with the social robot Pepper in entertainment scenarios, such as quiz, chat, and route guidance. In the 33 clips (of 1 to 4 min long) recorded from the robot point of view, the participants are interacting with the robot in an unconstrained manner.
The data set exhibits interesting features and difficulties, such as people leaving the field of view, robot moving (head rotation with embedded camera in the head), different illumination conditions. The data set contains color and depth videos from a Kinect v2, an Intel D435, and the video from Pepper.
All the visual faces and the identities in the data set were manually annotated, making the identities consistent across time and clips. The goal of the data set is to evaluate perception algorithms in multi-party human/robot interaction, in particular the re-identification part when a track is lost, as this ability is crucial for keeping the dialog history. The data set can easily be extended with other types of annotations.
We also present a benchmark on this data set that should serve as a baseline for future comparison. The baseline system, IHPER2 (Idiap Human Perception system) is available for research and is evaluated on the MuMMER data set. We show that an identity precision and recall of similar to 80% and a MOTA score above 80% are obtained.