A Low-latency On-chip Cache Hierarchy for Load-to-use Stall Reduction in GPUs
Memory hierarchy in Graphics Processing Units (GPUs) is conventionally designed to provide high bandwidth rather than low latency. In particular, because of the high tolerance to load-to-use latency (i.e., the time that warps wait for data fetched by memory loads), GPU L1D caches are optimized for density, capacity, and low power with latencies that are often orders of magnitude longer than conventional CPU caches. However, there are many important classes of data-parallel applications (e.g., graph, tree, priority queue processing, and sparse deep learning applications) that benefit from lower load-to-use latency than that offered by modern GPUs due to their inherent divergence and low effective Thread-Level Parallelism (TLP). This paper introduces an innovative on-chip cache hierarchy that incorporates a decoupled L1D cache with reduced latency (LoTUS) and its management scheme. LoTUS is a minimally sized fully associative cache placed in each GPU subcore that captures the primary working set of data-parallel applications. It exploits conventional high-performance low-density SRAM cells and dramatically reduces load-to-use latency. We also propose an intelligent extension of LoTUS, called LoTUSage, which employs a lightweight learning-based model to predict the utility of caching requests in LoTUS. Evaluation results show that LoTUS and LoTUSage improve the average performance by 23.9% and 35.4% and reduce the average energy consumption by 27.8% and 38.5%, respectively, for the applications suffering from high load-to-use stalls with negligible area and power overheads.
École Polytechnique Fédérale de Lausanne
2025-08-18
3760782
REVIEWED
EPFL