Infoscience

Thesis

A multimodal pattern recognition framework for speaker detection

Speaker detection is an important component of a speech-based user interface. Audiovisual speaker detection, speech and speaker recognition or speech synthesis for example find multiple applications in human-computer interaction, multimedia content indexing, biometrics, etc. Generally speaking, any interface which relies on speech for communication requires an estimate of the user's speaking state (i.e. whether or not he/she is speaking to the system) for its reliable functioning. One needs therefore to identify the speaker and discriminate from other users or background noise. A human observer would perform such a task very easily, although this decision results from a complex cognitive process referred to as decision-making. Generally speaking, this process starts with the acquisition by the human being of information about the environment, through each of its five senses. The brain then integrates these multiple information. An amazing property of this multi-sensory integration by the brain, as pointed out by cognitive sciences, is the perception of stimuli of different modalities as originating from a single source, provided they are synchronized in space and time. A speaker is a bimodal source emitting jointly an auditory signal and a visual signal (the motion of the articulators during speech production). The two signals are obviously co-occurring spatio-temporally. This interesting property allows us – as human observers – to discriminate between a speaking mouth and a mouth whose motion is not related with the auditory signal. This dissertation deals with the modelling of such a complex decision-making, using a pattern recognition procedure. A pattern recognition process comprises all the stages of an investigation, from data acquisition to classification and assessment of the results. In the audiovisual speaker detection problem, tackled more specifically in this thesis, the data are acquired using only one microphone and camera. The pattern recognizer integrates and combines these two modalities to perform and is therefore denoted as "multimodal". This multimodal approach is expected to increase the performance of the system. But it also raises many questions such as what should be fused, when in the decision process this fusion should take place, and how is it to be achieved. This thesis provides answers to each of these issues through the proposition of detailed solutions for each step of the classification process. The basic principle is to evaluate the synchrony between the audio and video features extracted from potentially speaking mouths, in order to classify each mouth as speaking or not. This synchrony is evaluated through a mutual information based function. A key to success is the extraction of suitable features. The audiovisual data are then processed through an information theoretic feature extraction framework after having been acquired and represented in a tractable way. This feature extraction framework uses jointly the two modalities in a feature-level fusion scheme. This way, the information originating from the common source is recovered while the independent noise is discarded. This approach is shown to minimize the probability of committing an error on the source estimate. These optimal features are put as inputs of the classifier, defined through a hypothesis testing approach. Using jointly the two modalities, it outputs a single decision about the class label of each candidate mouth region ("speaker" or "non-speaker"). Therefore, the acoustic and visual information are combined at both the feature and the decision levels, so that we can talk about a hybrid fusion method. The hypothesis testing approach gives means for evaluating the performance of the classifier itself but also of the whole pattern recognition system. In particular, the added-value offered by the feature extraction step can be assessed. The framework is applied in a first time with a particular emphasis on the audio modality: the information theoretic feature extraction addresses the optimization of the audio features using jointly the video information. As a result, audio features specific to speech production are produced. The system evaluation framework establishes that putting these features at input of the classifier increases its discrimination power with respect to equivalent non-optimized features. Then the enhancement of the video content is addressed more specifically. The mouth motion is obviously the suitable video representation for handling a task such as speaker detection. However, only an estimate of this motion, the optical flow, can be obtained. This estimation relies on the intensity gradient of the image sequence. Graph theory is used to establish a probabilistic model of the relationships between the audio, the motion and the image intensity gradient, in the particular case of a speaking mouth. The interpretation of this model leads back to the optimization function defined for the information theoretic feature extraction. As a result, a scale-space approach is proposed for estimating the optical flow, where the strength of the smoothness constraint is controlled via a mutual information based criterion involving both the audio and the video information. First results are promising even if more extensive tests should be carried out, in noisy conditions in particular. As a conclusion, this thesis proposes a complete pattern recognition framework dedicated to audiovisual speaker detection and minimizing the probability of misclassifying a mouth as "speaker" or "non-speaker". The importance of fusing the audio and video content as soon as at the feature level is demonstrated through the system evaluation stage included in the pattern recognition process.

Related material

Contacts

EPFL authors