The rapid advancement of modern neural networks has introduced considerable challenges for model deployment, particularly in edge artificial intelligence (AI) scenarios, where computational resources and energy budgets are severely constrained. Quantization mitigates this issue by approximating weights and/or activations in lower-bit representations, effectively reducing model size and power consumption. However, most existing hardware architectures typically support only uniform quantization, which assigns the same precision level to all layers and neglects their differing tolerances to quantization noise, thereby failing to fully exploit algorithmic robustness and achieve optimal efficiency. Techniques such as heterogeneous (mixed-precision) quantization have been proposed, allowing fine-grained and structure-aware precision assignments (e.g., per-layer, per-channel, or per-block) that better balance accuracy and efficiency. While these methods exhibit strong algorithmic potential, they require hardware capable of supporting variable-precision computation, including flexible control over data formats and arithmetic operations.
Filling these gaps, this thesis proposes a variable-precision parallel computing paradigm, and comprehensively investigates and realizes it across circuit design, microarchitecture exploration, system integration, and benchmark validation. The proposed approach thoroughly leverages the robustness of quantized algorithms and their intrinsic data-level parallelism, thereby providing edge AI with a highly efficient and adaptable computing scheme.
This work first introduces the software-defined single instruction multiple data (Soft SIMD) methodology and its corresponding circuit implementation. The technique facilitates dynamic subword partitioning at runtime through instruction and control signals, effectively balancing operand precision and data-level parallelism within a fixed datapath size. Fundamental arithmetic units are then developed, enabling vector addition/subtraction/shift and vector-by-scalar multiplication through shift-add iterations, which are crucial for dominant linear computation in machine learning (ML) workloads such as convolution and general matrix multiplication (GEMM). A two-stage pipeline microarchitecture is further presented and thoroughly explored, comprising an arithmetic computation stage and a data repacking stage, seamlessly managing and bridging operations across different precision levels. The hardware design is initially implemented in fixed-point data format and subsequently extended to block floating point (BFP) format which combines the dynamic range of floating-point representation with the efficiency and simplicity of fixed-point arithmetic. Experimental evaluations using heterogeneously quantized convolutional neural networks (CNNs) inference demonstrate that the proposed design achieves high computational efficiency and maintains performance comparable to baseline approaches. Furthermore, this thesis identifies and leverages the scaling opportunity between operand bitwidth and operating voltage/frequency, highlighting an effective way to exploit the benefits of variable-precision arithmetic fully. Finally, the proposed arithmetic methodology is integrated into a computing-near-memory (CNM) solution at the DRAM bank level for comprehensive system performance evaluation. It has also been taped out on an open-source RISC-V platform for silicon validation.
EPFL_TH10919.pdf
Main Document
Published version
restricted
N/A
5.89 MB
Adobe PDF
6a3ffc72de1424f04642cc8fc6588217