How does a dictation machine recognize speech?

There is magic (or is it witchcraft?) in a speech recognizer that transcribes continuous radio speech into text with a word accuracy of even not more than 50%. The extreme difficulty of this task, tough, is usually not perceived by the general public. This is because we are almost deaf to the infinite acoustic variations that accompany the production of vocal sounds, which arise from physiological constraints (co-articulation), but also from the acoustic environment (additive or convolutional noise, Lombard effect), or from the emotional state of the speaker (voice quality, speaking rate, hesitations, etc.)46. Our consciousness of speech is indeed not stimulated until after it has been processed by our brain to make it appear as a sequence of meaningful units: phonemes and words. In this Chapter we will see how statistical pattern recognition and statistical sequence recognition techniques are currently used for trying to mimic this extraordinary faculty of our mind (4.1). We will follow, in Section 4.2, with a MATLAB-based proof of concept of word-based automatic speech recognition (ASR) based on Hidden Markov Models (HMM), using a bigram model for modeling (syntactic-semantic) language constraints.

Related material