Pit-Claudel, ClémentDe Castelnau, Julien François2024-10-082024-10-082024-09-302024-09-24https://infoscience.epfl.ch/handle/20.500.14299/241491Writing high-performance software these days is a challenging task. In the past, CPU performance scaled regularly with advances in device technology, leaving much of the work to algorithmic optimizations and reducing overhead. Since then, the scaling of CPU performance has waned significantly, and the hardware world has compensated with increasingly application-specific hardware, often known as accelerators. Despite years of effort in automation, by and large, the burden still falls on the developer to write a program in a manner that can take advantage of this hardware. In compute-bound workloads, the penalty for not doing so is severe: in the example of matrix multiplication, a nearly 1300x speedup was observed optimizing a plain C program to properly use the cache, SIMD, multicore, etc. Worst of all, the work to write such programs is often repeated with little reuse for each custom hardware targeted, creating a massive effort for the developer. This issue will serve as the focal point for this report. We seek to understand the challenges associated with software development for accelerators and some of the proposed solutions for automation in the literature. However, this report is not intended to provide a comprehensive literature survey. Instead, we will investigate this question through a practical case study on developing software for a simple dense matrix multiplication accelerator. We will focus on solutions addressing the most salient challenges encountered. Given the prominence of this type of hardware to accelerate popular applications like deep learning [1, 2, 5, 6, 8], we hope for the findings of this case study to shed broader insights on this problem.enSoftware optimization for a RISC-V accelerator: A case studystudent work::semester or other student projects