Software optimization for a RISC-V accelerator: A case study
Writing high-performance software these days is a challenging task. In the past, CPU performance scaled regularly with advances in device technology, leaving much of the work to algorithmic optimizations and reducing overhead. Since then, the scaling of CPU performance has waned significantly, and the hardware world has compensated with increasingly application-specific hardware, often known as accelerators. Despite years of effort in automation, by and large, the burden still falls on the developer to write a program in a manner that can take advantage of this hardware. In compute-bound workloads, the penalty for not doing so is severe: in the example of matrix multiplication, a nearly 1300x speedup was observed optimizing a plain C program to properly use the cache, SIMD, multicore, etc. Worst of all, the work to write such programs is often repeated with little reuse for each custom hardware targeted, creating a massive effort for the developer. This issue will serve as the focal point for this report. We seek to understand the challenges associated with software development for accelerators and some of the proposed solutions for automation in the literature. However, this report is not intended to provide a comprehensive literature survey. Instead, we will investigate this question through a practical case study on developing software for a simple dense matrix multiplication accelerator. We will focus on solutions addressing the most salient challenges encountered. Given the prominence of this type of hardware to accelerate popular applications like deep learning [1, 2, 5, 6, 8], we hope for the findings of this case study to shed broader insights on this problem.
Software optimization for a RISC-V accelerator - A case study.pdf
Main Document
http://purl.org/coar/version/c_be7fb7dd8ff6fe43
openaccess
CC BY
200.42 KB
Adobe PDF
d89444457924c57bbc1a35b421487440