Towards using slide information to enhance speech transcription of meetings

In this paper we investigate the possibility of improving the speech recognition performance of meeting recordings by using slides captured during the recording process. The key hypothesis exploited in this work is that both slides and speech carry correlated contextual and semantic information. Thus, we propose an approach using the information extracted from slides aimed at reducing the speech recognition word error rate. The N-Best lists output by the recogniser are rescored through Information Retrieval techniques to maximise the similarity between speech and slides transcripts. Results obtained on three meeting recordings (for a total duration of about 90 minutes) show no statistically significant variation of the word error rate. Additional studies provide further insight based on both language properties and statistics of the word distributions in the two sources.

Related material