Files

Abstract

In this work, we propose different strategies for efficiently integrating an automated speech recognition module in the framework of a dialogue-based vocal system. The aim is the study of different ways leading to the improvement of the quality and robustness of the recognition. We first concentrate on the choice of the type of acoustic models that should be used for the speech recognition. Our goal is to evaluate the hypothesis that hybrid acoustic models, in which estimation of frame-based phoneme probabilities is made through artificial neural networks, provide performance results similar to the "classical" Hidden-Markov models using Multi-Gaussian estimations, while being more robust in generalization across tasks. We experimentally show that, due to the size of the parameter space to be explored, it is not always practically possible to achieve a performance comparable to the one of Multi-Gaussian models, and that in fact hybrid models often lead to worse recognition performance. In a second part, we focus on one of the main limitations of state-of-the-art speech recognition: the inadequacy of the one-best approach to yield a hypothesis corresponding to the right transcription. For that, we explore the solution consisting in producing, during acoustic decoding, a word lattice containing a very large number of hypotheses, that is then filtered by a syntactic analyzer using more sophisticated syntactic models, such as stochastic context-free grammars. The goal of this approach is to yield syntactically correct hypotheses for further processing. More precisely, we study the approach proach consisting in dynamically tuning the relative importance of the acoustic and language models, resulting in the increase of the lexical and syntacticonsisting variability in the word lattice. We identify and experimentally quantify two important drawbacks for this approach: its high computational cost and the impossibility to guarantee that, in practice, the correct solution is indeed present in the lattice. Finally, we study the problem of the inadequacy of the use of generic linguistic resources (language models and phonetic lexica) to yield robust and efficient recognition results. In this context, we explore the solution consisting in the integration of dynamic phonetic and language models controlled by an associated dialogue model. In this approach, restricted lexicon and language models dependent on the context of the dialogue are used in place of the complete ones. We first experimentally verify that this approach indeed yields a significant increase in speech recognition performance, and we then focus on the problem of producing, for a given application, the adequate dialogue model that can efficiently integrate the speech recognition module. In this perspective, we propose an enhancement of the used dialogue model prototyping methodology by integrating speech recognition error simulation within the Wizard-of-Oz dialogue simulation. We show that such an approach enables a more complete prototyping of the dialogue model that guarantees a better adequacy of the resulting dialogue model to the targeted vocal application.

Details

Actions