Structured pruning for efficient systolic array accelerated cascade Speech-to-Text Translation
We present in this paper a simple method for pruning tiles of weights in sparse matrices, that do not require fine-tuning or retraining. This method is applied here to the feed-forward layers of transformers. We assess in a first experiment the impact of such pruning on the performances of speech recognition, machine translation, and the cascaded speech-to-text translation, on the MuST-C database, for the English to French direction. Depending on the size of the pruned tiles (from 4x4 to 32x32), we observe that pruning rates from 15 to 40% for speech recognition and from 40 to 70% for machine translation are feasible for a performance degradation of 10%. Applying this pruning method to the systolic array accelerated version of the cascade speech-to-text translation system results in speedups up to 74x compared to the non-accelerated system. Energy consumption also benefits from structured pruning with a maximum reduction of 35%.
Structured pruning for efficient systolic array accelerated cascade Speech-to-Text Translation.pdf
Main Document
Accepted version
openaccess
N/A
1.34 MB
Adobe PDF
7978ae8d65bc1ee06ab01cc621cb37cf