An Associativity-Agnostic in-Cache Computing Architecture Optimized for Multiplication

Rios, Marco Antonio; Simon, William Andrew; Levisse, Alexandre Sébastien Julien; Zapater Sancho, Marina; Atienza Alonso, David

doi:10.1109/VLSI-SoC.2019.8920317

conference paper

An Associativity-Agnostic in-Cache Computing Architecture Optimized for Multiplication

Rios, Marco Antonio

•

Simon, William Andrew

•

Levisse, Alexandre Sébastien Julien

October 9, 2019

2019 IFIP/IEEE 27th International Conference on Very Large Scale Integration (VLSI-SoC)

With the spread of cloud services and Internet of Things concept, there is a popularization of machine learning and artificial intelligence based analytics in our everyday life. However, an efficient deployment of these data-intensive services requires performing computations closer to the edge. In this context, in-cache computing, based on bitline computing, is promising to execute data-intensive algorithms in an energy efficient way by mitigating data movement in the cache hierarchy and exploiting data parallelism. Nevertheless, previous in-cache computing architectures contain serious circuit-level deficiencies (i.e., low bitcell density, data corruption risks, and limited performance), thus report high multiplication latency, which is a key operation for machine learning and deep learning. Moreover, no previous work addresses the issue of way misalignment, strongly constraining data placement not to reduce performance gains. In this work we drastically improve the previously proposed BLADE architecture for in-cache computing to efficiently support multiplication operations by enhancing the local bitline circuitry, enabling associativity-agnostic operations as well as in-place shifting inside local bitline groups. We implemented and simulated the proposed architecture in CMOS 28nm bulk technology from TSMC, validating its functionality and extracting its performance, area, and energy per operation. Then, we designed a behavioral model of the proposed architecture to assess its performance with respect to the latest BLADE architecture. We show a 17.5 and 22% area and energy reduction thanks to the proposed LG optimization. Finally, for 16bits multiplication, we demonstrate 44% cycle count, 47% energy and 41% performances gain versus BLADE and show that 4 embedded shifts is the best trade-off between energy, area and performances.

Name

VLSI-SoC19 Rios .pdf

Type

Preprint

Version

http://purl.org/coar/version/c_71e4c1898caa6e32

Access type

openaccess

Size

4.96 MB

Format

Adobe PDF

Checksum (MD5)

db695385e1de7fd09fab664f79a34ea4