Extraction of Audio Features Specific to Speech using Information Theory and Differential Evolution

We present a method that exploits an information theoretic framework to extract optimized audio features using the video information. A simple measure of mutual information (MI) between the resulting audio features and the video ones allows to detect the active speaker among different candidates. Our method involves the optimization of an MI-based objective function. No approximation is introduced to solve this optimization problem, neither concerning the estimation of the probability density functions (pdf) of the features, nor the cost function itself. The pdf are estimated from the samples using a non-parametric approach. As far as concern the optimization process itself, three different optimization methods (one local and two globals) are compared in this paper. The Differential Evolution algorithm is shown to be outstanding performant for our problem and is threrefore eventually retains. Two information theoretic optimization criteria are compared and their ability to extract audio features specific to speeh is discussed. As a result, our method achieves a speaker detection rate of 100% on our test sequences, and of 95% on a state-of-the-art sequence.

Related material