XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

Kumar, ShashiMadikeri, SrikanthZuluaga Gomez, Juan PabloVillatoro-Tello, EsaúThorbecke, IuliiaMotlicek, PetrManjunath, K. E.Ganapathiraju, Aravind2025-05-122025-05-122025-05-09202510.1109/ICASSP49660.2025.108881102-s2.0-105003883075https://infoscience.epfl.ch/handle/20.500.14299/250009Self-supervised pretrained models exhibit competitive performance in automatic speech recognition (ASR) on finetuning, even with limited in-domain supervised data. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. Our experiments on the AMI dataset reveal that the XLSR-Transducer achieves 4% absolute WER improvement over Whisper large-v2 and 8% over a Zipformer transducer model trained from scratch. To enable streaming capabilities, we investigate different attention masking patterns in the self-attention computation of transformer layers within the XLSR-53 model. We validate XLSR-Transducer on AMI and 5 languages from CommonVoice under low-resource scenarios. Finally, with the introduction of attention sinks, we reduce the left context by half while achieving a relative 12% improvement in WER.enfalseself-supervised learningstreaming ASRtransformer transducerXLSRXLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Modelstext::conference output::conference proceedings::conference paper