Files

Résumé

Transformer models have achieved impressive results in various AI scenarios, ranging from vision to natural language processing. However, their computational complexity and their vast number of parameters hinder their implementations on resource-constrained platforms. Furthermore, while loosely-coupled hardware accelerators have been proposed in the literature, data transfer costs limit their speed-up potential. We address this challenge along two axes. First, we introduce tightly-coupled, small-scale systolic arrays (TiC-SATs), governed by dedicated ISA extensions, as dedicated functional units to speed up execution. Then, thanks to the tightly-coupled architecture, we employ software optimizations to maximize data reuse, thus lowering miss rates across cache hierarchies. Full system simulations across various BERT and VisionTransformer models are employed to validate our strategy, resulting in substantial application-wide speed-ups (e.g., up to 89.5X for BERT-large). TiC-SAT is available as an open-source framework.

Détails

PDF