Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. An Associativity-Agnostic in-Cache Computing Architecture Optimized for Multiplication
 
conference paper

An Associativity-Agnostic in-Cache Computing Architecture Optimized for Multiplication

Rios, Marco Antonio
•
Simon, William Andrew  
•
Levisse, Alexandre Sébastien Julien  
Show more
October 9, 2019
2019 IFIP/IEEE 27th International Conference on Very Large Scale Integration (VLSI-SoC)

With the spread of cloud services and Internet of Things concept, there is a popularization of machine learning and artificial intelligence based analytics in our everyday life. However, an efficient deployment of these data-intensive services requires performing computations closer to the edge. In this context, in-cache computing, based on bitline computing, is promising to execute data-intensive algorithms in an energy efficient way by mitigating data movement in the cache hierarchy and exploiting data parallelism. Nevertheless, previous in-cache computing architectures contain serious circuit-level deficiencies (i.e., low bitcell density, data corruption risks, and limited performance), thus report high multiplication latency, which is a key operation for machine learning and deep learning. Moreover, no previous work addresses the issue of way misalignment, strongly constraining data placement not to reduce performance gains. In this work we drastically improve the previously proposed BLADE architecture for in-cache computing to efficiently support multiplication operations by enhancing the local bitline circuitry, enabling associativity-agnostic operations as well as in-place shifting inside local bitline groups. We implemented and simulated the proposed architecture in CMOS 28nm bulk technology from TSMC, validating its functionality and extracting its performance, area, and energy per operation. Then, we designed a behavioral model of the proposed architecture to assess its performance with respect to the latest BLADE architecture. We show a 17.5 and 22% area and energy reduction thanks to the proposed LG optimization. Finally, for 16bits multiplication, we demonstrate 44% cycle count, 47% energy and 41% performances gain versus BLADE and show that 4 embedded shifts is the best trade-off between energy, area and performances.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

VLSI-SoC19 Rios .pdf

Type

Preprint

Version

http://purl.org/coar/version/c_71e4c1898caa6e32

Access type

openaccess

Size

4.96 MB

Format

Adobe PDF

Checksum (MD5)

db695385e1de7fd09fab664f79a34ea4

Loading...
Thumbnail Image
Name

VLSI-SoC19 Rios final.pdf

Type

Publisher's Version

Version

http://purl.org/coar/version/c_970fb48d4fbd8a85

Access type

openaccess

Size

4.96 MB

Format

Adobe PDF

Checksum (MD5)

db695385e1de7fd09fab664f79a34ea4

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés