Audio-Visual Speech Modelling for Continuous Speech Recognition

Dupont, Stéphane; Luettin, Juergen

doi:10.1109/6046.865479

research article

Audio-Visual Speech Modelling for Continuous Speech Recognition

Dupont, Stéphane

•

Luettin, Juergen

2000

IEEE Transactions on Multimedia

This paper describes a complete system for audio-visual recognition of continuous speech including robust lip tracking, visual feature extraction, noise-robust acoustic feature extraction, and sensor integration. An appearance based model of the articulators, which represents linguistically important features, is learned from example images and is used to locate, track, and recover visual speech information. We tackle the problem of joint temporal modeling of the acoustic and visual speech signal by applying multi-stream hidden Markov models. This approach allows the definition of different temporal topologies and levels of stream integration and hence enables to model temporal dependencies more accurately than traditional approaches. We present two different methods to learn the asynchrony between the two modalities and how to incorporate them in the multi-stream models. The superior performance for the proposed system is demonstrated on a large multi-speaker database of continuously spoken digits.

Type

research article

DOI

10.1109/6046.865479

Authors

Dupont, Stéphane

•

Luettin, Juergen

Publication date

2000

Published in

IEEE Transactions on Multimedia

Volume

5

Issue

3

Start page

141

End page

151

Subjects

vision

Peer reviewed

REVIEWED

EPFL units

LIDIAP

Available on Infoscience

March 10, 2006

Use this identifier to reference this record

https://infoscience.epfl.ch/handle/20.500.14299/227945