In-Memory Hardware and Architectural Extensions for Workloads Acceleration
Utilization of edge devices has exploded in the last decade, with such use cases as wearable devices, autonomous driving, and smart homes. As their ubiquity grows, so do expectations of their capabilities. Simultaneously, their formfactor and use cases limit power availability. Thus, improving performance while limiting area and power consumption is paramount.
In this vein, in-SRAM Computing (iSC) moves computation from the CPU into the SRAM memory hierarchy. This has multiple benefits. First, reduced data movement mitigates power consumption and latency. Second, the entire memory array can be utilized to perform hundreds of concurrent operations. This thesis exploits iSC while addressing the aforementioned challenges via a BitLine Accelerator for Devices on the Edge (BLADE). BLADE can be implemented in any SRAM system and utilizes local wordline groups to perform computations at a frequency 2.8x higher than state-of-the-art iSC architectures. BLADE is thoroughly simulated, fabricated, and benchmarked at the transistor, architecture, and software abstraction levels. Experimental results demonstrate performance/energy gains over an equivalent NEON accelerated processor for a variety of edge device workloads, namely, cryptography (4x performance gain/6x energy reduction), video encoding (6x/2x), and convolutional neural networks (3x/1.5x), while maintaining the highest frequency/energy ratio (up to 2.2Ghz@1V) of any conventional iSC computing architecture, and a low area overhead of less than 8%.
With BLADE implemented, the possibilities for enhancement are manifold, with one such example being approximate computing. To this end, a CArryless Partial Product InExact Multiplier (CAPPIEM) halves multiplication latency while incurring negligible area overhead. As a standalone multiplier, CAPPIEM reduces the area/power-delay-product by 73/43%, respectively. Further, CAPPIEM has the unique property of computing exact results when one input is a Fibonacci encoded value. This property is exploited via a retraining strategy which quantizes neural network weights to Fibonacci values, ensuring exact computation during inference. Benchmarking on Squeezenet 1.0, DenseNet-121, and ResNet-18 demonstrate accuracy degradations of only 0.4/1.1/1.7%, while improving training time by up to 300x.
A second BLADE enhancement is the use of Hybrid Caches (HCs) consisting of both SRAM and eNVRAM bitcells. HCs increase capacity and power savings via eNVRAM's small area footprint and low leakage energy. However, eNVRAMs also incur long write latency and limited endurance. To mitigate these drawbacks, this thesis presents SHyCache, an HC architecture and supporting programming model. By explicitly allocating variables with high read/write access ratios to the eNVRAM array, SHyCache reduces access time, power consumption, and area overhead, while maintaining maximal utilization efficiency and ease of programming. Benchmarks on a range of cache hierarchy variations using three deep neural networks demonstrate a design space that can be exploited to optimize performance, power consumption, or endurance, while demonstrating maximum performance gains of 1.7/1.4/1.3x and power consumption reductions of 5.1/5.2/5.4x.
EPFL_TH9261.pdf
n/a
openaccess
copyright
8.89 MB
Adobe PDF
c48ed5312c07b27e7bc4125634c57976