A Flexible In-Memory Computing Architecture for Heterogeneously Quantized CNNs
Inferences using Convolutional Neural Networks (CNNs) are resource and energy intensive. Therefore, their execution on highly constrained edge devices demands the careful co-optimization of algorithms and hardware. Addressing this challenge, in this paper we present a flexible In-Memory Computing (IMC) architecture and circuit, able to scale data representations to varying bitwidths at run-time, while ensuring high level of parallelism and requiring low area. Moreover, we introduce a novel optimization heuristic, which tailors the quantization level in each CNN layer according to workloads and robustness considerations. We investigate the performance, accuracy and energy requirements of our co-design approach on CNNs of varying sizes, obtaining up to 76.2% increases in efficiency and up to 75.6% reductions in run-time with respect to fixed-bitwidth alternatives, for negligible accuracy degradation.
Quantization2020.pdf
postprint
openaccess
copyright
3 MB
Adobe PDF
6895427e6e6a6b65bcfc130f9038d961