Infoscience

Report

Speech Enhancement and Recognition in Meetings with an Audio-Visual Sensor Array

We address the problem of distant speech acquisition in multi-party meetings, using multiple microphones and cameras. Microphone array beamforming techniques present a potential alternative to close-talking microphones by providing speech enhancement through spatial filtering and directional discrimination. Beamforming techniques rely on the knowledge of a speaker location. In this paper, we present an integrated approach, in which an audio-visual multi-person tracker is used to track active speakers with high accuracy. Speech enhancement is then achieved using microphone array beamforming followed by a novel post-filtering stage. Finally, speech recognition is performed to evaluate the quality of the enhanced speech signal. The approach is evaluated on the data recorded in a real meeting room for stationary speaker, moving speaker and overlapping speech scenarios. The results show that the speech enhancement and recognition performance, achieved using our approach are significantly better than single table-top microphone and comparable to lapel microphone for all the scenarios. The results also indicate that the audio-visual based system performs significantly better than audio-only system, both in terms of enhancement and recognition. This reveals that the accurate speaker tracking, provided by the audio-visual sensor array proved beneficial to improve the recognition performance in a microphone array based speech recognition system.

Related material