Systolic Arrays and Structured Pruning Co-design for Efficient Transformers in Edge Systems

Palacios Almendros, Pedro; Medina Morillas, Rafael; Rouas, Jean-Luc; Ansaloni, Giovanni; Atienza Alonso, David

doi:10.1145/3716368.3735158

conference poster

Systolic Arrays and Structured Pruning Co-design for Efficient Transformers in Edge Systems

Palacios Almendros, Pedro

•

Medina Morillas, Rafael

•

Rouas, Jean-Luc

2025

GLSVLSI '25: Proceedings of the Great Lakes Symposium on VLSI 2025

35th Great Lakes Symposium on VLSI

Efficient deployment of resource-intensive transformers on edge devices necessitates cross-stack optimization. We thus study the interrelation between structured pruning and systolic acceleration, matching the size of pruned blocks with the systolic array dimensions. In this setting, computations of pruned weight blocks can be skipped, reducing run-time and energy consumption, but potentially impacting quality of service (QoS). To evaluate the trade-offs between systolic array size and sparsity opportunities, we present a novel co-design framework that integrates algorithmic optimization, system simulation, and hardware design. Targeting speech recognition and machine translation using transformers as case study, we analyze how configuration choices across the stack affect performance metrics. Results demonstrate that structured pruning on systems featuring systolic array acceleration can effectively increase performance, while maintaining high QoS levels. Up to 44% system-wide speedups due to structured pruning and quantization were measured, with only 1.4% word error rate degradation on the standard LibriSpeech dataset.