Abstract

We experiment with subword segmentation approaches that are widely used to address the open vocabulary problem in the context of end-to-end automatic speech recognition (ASR). For morphologically rich languages such as German which has many rare words mainly due to compound words, there is an increasing interest in subword-level word representation based on, e.g., byte-pair encoding and unigram language model. However, we are not aware of any systematic comparative analysis of different approaches. To this end, we propose a framework which estimates a difficulty score of a test utterance for the ASR model based on an out-of-vocabulary metric. Using this framework we run experiments on several subword segmentation approaches, which provides us with comparative analysis on the strengths and weaknesses of them. For the ASR model, we employ a fully convolutional sequence-to-sequence encoder architecture using time-depth separable convolution blocks and a lexicon-free beam search decoding with n-grams subword language model. Additionally, we leverage multiple models with different word representations to investigate their impact on ASR performance

Details

Actions