IDIAP HMM/HMM2 System: Theoretical Basis and Software Specifications

State-of-the-art Automatic Speech Recognition (ASR) systems make extensive use of Hidden Markov Models (HMMs), characterized by flexible statistical modeling, powerful optimization (training) techniques and efficient recognition algorithms. When allowed by the software implementation, their flexibility can also be fully exploited in research, by testing various topologies, acoustic units, parameterization schemes, etc. Unfortunately, these HMM systems still suffer from an excessive sensitivity to the variability generally observed in real acoustic environments, including speaker, channel and noise characteristics. In an attempt to tackle this problem, IDIAP recently introduced a new form of HMM, referred to as HMM2, exhibiting numerous potential advantages, which could result in improved robustness of current speech recognition systems. HMM2 can be described as a mixture of HMMs where the HMM emission probabilities (usually estimated by Gaussian Mixtures or a neural network) are themselves estimated by state-specific HMMs working along the acoustic features. Among other properties, it is believed that such HMM2 approach could better model the time/frequency speech flow, including better modeling of the feature correlation. After a brief reminder of the HMM theory, this report first introduces the theoretical basis of HMM2, including their parameterization schemes and the estimation of their parameters through a generalized form of the Expectation-Maximization (EM) training algorithm. It is also the goal of the present report to describe the functionalities and specifications of a new software able to handle, in a flexible way, different forms of HMM and HMM2 topologies and training schemes.

Related material