Observations on Multi-Band Asynchrony in Distant Speech Recordings

Whenever the speech signal is captured by a microphone distant from the user, the acoustic response of the room introduces significant distortions. To remove these distortions from the signal, solutions exist that greatly improve the ASR performance (what was said?), such as dereverberation or beamforming. It may seem natural to apply those signal-level methods in the context of speaker clustering (who spoke when?) with distant microphones, for example when annotating a meeting recording for enhanced browsing experience. Unfortunately, on a corpus of real meeting recordings, it appeared that neither dereverberation nor beamforming gave any improvement on the speaker clustering task. The present technical report constitutes a first attempt to explain this failure, through a cross-correlation analysis between close-talking and distant microphone signals. The various frequency bands of the speech spectrum appear to become desynchronized when the speaker is 1 or 2 meters away from the microphone. Further directions of research are suggested to model this desynchronization.

Related material