Visual feature analysis for audio-visual speech recognition

Humans perceive their surrounding environment in a multimodal manner by using multi-sensory inputs combined in a coordinated way. Various studies in psychology and cognitive science indicate the multimodal nature of human speech production and perception. Such findings have motivated the development of a new and fast growing research field of audio-visual speech processing. Exhaustive research done so far shows that systems using both visual and acoustic data do perform better than their conventional "blind" counterparts. The choice of suitable visual cues is of paramount importance in such systems. The aim of this thesis is to analyze visual features and evaluate their relevance from an automatic speech recognition point of view. Speech pertinent information is mainly conveyed in the mouth area and a fundamental requirement is to obtain its representation in a parameterized form. The first objective of this work is to address the problem of visual front end design. This represents a challenging issue that involves complex computer vision tasks of face detection and facial feature tracking. After finding the speaker's face, it is important to accurately detect the mouth area and subsequently track it throughout a video sequence. In this thesis a method for nose and mouth region of interest tracking, has been developed for the application scenario restricted to audio-visual speech recognition. Nose tracking is accomplished by employing a template matching technique. Further on, the mouth region is localized taking into account knowledge of geometric properties of the human face. The next objective of this work is set upon the need for robust and accurate extraction of key lip points of interest. To accomplish such a task two novel algorithms are developed. The first one aims to find four key lip points that enable normalization of mouth images. The second algorithm provides additional mouth shape descriptors and the basis for fitting labial contours. The emphasis of this study is on audio-visual speech recognition which requires visual features suitable for classification. A question that immediately arises is what the most speech salient features are. To date different methods are proposed that mainly rely on some a priori defined rules. Area based features are selected based on criteria inherited from image compression, where geometric based methods use set of cues selected as meaningfull one for humans. There is no guarantee that such representations are optimal for automatic recognition of the uttered speech. The study done in this thesis aims to observe this problem from a different angle by including basic information theoretic concepts in visual feature analysis. The selection criteria employs Mutual Information as a measure of how informative visual cues are. The concept of mutual information eigenlips is introduced to reflect the most informative eigenfeatures. Visual features obtained using an information theoretic framework are further used for audio-visual speech recognition experiments. The obtained results indicate improved system performance when such visual cues are applied. In conclusion, the proposed information theoretic framework for visual feature selection can be applied in other audio-visual speech processing areas. The possible applications include audio-visual speaker detection and speaker localization.


Related material