The advent of the quantum computer makes current public-key infrastructure insecure. Cryptography community is addressing this problem by designing, efficiently implementing, and evaluating novel public-key algorithms capable of withstanding quantum computational power. Governmental agencies, such as NIST, are promoting standardization of quantum-resistant algorithms that is expected to run for 7 years. Several modern applications must maintain permanent data secrecy; therefore, they ultimately require the use of quantum-resistant algorithms. Because algorithms are still under scrutiny for eventual standardization, the deployment of the hardware implementation of quantum-resistant algorithms is still in early stages.
In this article, we propose a methodology to design programmable hardware accelerators for lattice-based algorithms, and we use the proposed methodology to implement flexible and energy efficient post-quantum cache-based accelerators for NewHope, Kyber, Dilithium, Key Consensus from Lattice (KCL), and R.EMBLEM submissions to the NIST standardization contest.
To the best of our knowledge, we propose the first efficient domain-specific, programmable cache-based accelerators for lattice-based algorithms. We design a single accelerator for a common kernel among various schemes with different kernel sizes, i.e., loop count, and data types. This is in contrast to the traditional approach of designing one special purpose accelerators for each scheme.
We validate our methodology by integrating our accelerators into an HLS-based SoC infrastructure based on the X86 processor and evaluate overall performance. Our experiments demonstrate the suitability of the approach and allow us to collect insightful information about the performance bottlenecks and the energy efficiency of the explored algorithms. Our results provide guidelines for hardware designers, highlighting the optimization points to address for achieving the highest energy minimization and performance increase. At the same time, our proposed design allows us to specify and execute new variants of lattice-based schemes with superior energy efficiency compared to the main application processor without changing the hardware acceleration platform. For example, we manage to reduce the energy consumption up to 2.1x and energy-delay product (EDP) up to 5.2x and improve the speedup up to 2.5x.