Files

Abstract

Auditory and visual cues are important sensor inputs for biological and artificial systems. They provide crucial information for navigating environments, recognizing categories, animals and people. How to combine effectively these two sensory channels is still an open issue. As a step towards this goal, this paper presents a comparison between three different multi-modal integration strategies, for audiovisual object category detection. We consider a high-level and a low-level cue integration approach, both biologically motivated, and we compare them with a mid-level cue integration scheme. All the three integration methods are based on the least square support vector machine algorithm, and state of the art audio and visual feature representations. We conducted experiments on two audio-visual object categories, dogs and guitars, presenting different visual and auditory characteristics. Results show that the high-level integration scheme consistently performs better than single cue methods, and of the other two integration schemes. These findings confirm results from the neuroscience. This suggests that the high-level integration scheme is the most suitable approach for multi-modal cue integration for artificial cognitive systems.

Details

Actions

Preview