Improving Speech Recognition Using a Data-Driven Approach

In this paper, we investigate the possibility of enhancing state-of-the-art HMM-based speech recognition systems using data-driven techniques, where whole set of training utterances is used as reference models and recognition is then performed through the well-known template matching technique, DTW. This approach allows us to better capture the temporal dynamics of the speech signal while avoiding some of the HMM assumptions such as the piecewise stationarity. Potentially, such data-driven techniques also allow us to better exploit meta-data and environmental information, such as speaker, gender, accent and noise conditions. However, we cannot entirely abandon HMMs, which are very powerful and scalable models. Thus, we investigate one way to combine and take advantage of both the approaches, combining scores of HMMs and reference templates. Experiments on the Numbers95 database showed that this combination yields 22% relative improvement in word error rate over the baseline HMM performance. Applying K-means clustering to the acoustic vectors speeds up the decoding, while still retaining a significant improvement in the recognition accuracy.

Related material