Résumé

This paper describes a complete system for audio-visual recognition of continuous speech including robust lip tracking, visual feature extraction, noise-robust acoustic feature extraction, and sensor integration. An appearance based model of the articulators, which represents linguistically important features, is learned from example images and is used to locate, track, and recover visual speech information. We tackle the problem of joint temporal modeling of the acoustic and visual speech signal by applying multi-stream hidden Markov models. This approach allows the definition of different temporal topologies and levels of stream integration and hence enables to model temporal dependencies more accurately than traditional approaches. We present two different methods to learn the asynchrony between the two modalities and how to incorporate them in the multi-stream models. The superior performance for the proposed system is demonstrated on a large multi-speaker database of continuously spoken digits.

Détails