XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models
Self-supervised pretrained models exhibit competitive performance in automatic speech recognition (ASR) on finetuning, even with limited in-domain supervised data. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. Our experiments on the AMI dataset reveal that the XLSR-Transducer achieves 4% absolute WER improvement over Whisper large-v2 and 8% over a Zipformer transducer model trained from scratch. To enable streaming capabilities, we investigate different attention masking patterns in the self-attention computation of transformer layers within the XLSR-53 model. We validate XLSR-Transducer on AMI and 5 languages from CommonVoice under low-resource scenarios. Finally, with the introduction of attention sinks, we reduce the left context by half while achieving a relative 12% improvement in WER.
2-s2.0-105003883075
École Polytechnique Fédérale de Lausanne
Institut Dalle Molle D'intelligence Artificielle Perceptive
École Polytechnique Fédérale de Lausanne
Institut Dalle Molle D'intelligence Artificielle Perceptive
Institut Dalle Molle D'intelligence Artificielle Perceptive
Institut Dalle Molle D'intelligence Artificielle Perceptive
Uniphore
Uniphore
2025
Piscataway, NJ USA
9798350368741
REVIEWED
EPFL
Event name | Event acronym | Event place | Event date |
ICASSP 2025 | Hyderabad, India | 2025-04-06 - 2025-04-11 | |