A Low-latency On-chip Cache Hierarchy for Load-to-use Stall Reduction in GPUs

(Nematollahi zadeh) Mahani, Negin (Sadat); Falahati, Hajar; Darabi, Sina; Javadi-Nezhad, Ahmad; Oh, Yunho; Sadrosadati, Mohammad; Sarbazi-Azad, Hamid; Falsafi, Babak

doi:10.1145/3760782

research article

A Low-latency On-chip Cache Hierarchy for Load-to-use Stall Reduction in GPUs

(Nematollahi zadeh) Mahani, Negin (Sadat)

•

Falahati, Hajar

•

Darabi, Sina

August 18, 2025

ACM Transactions on Architecture and Code Optimization

Memory hierarchy in Graphics Processing Units (GPUs) is conventionally designed to provide high bandwidth rather than low latency. In particular, because of the high tolerance to load-to-use latency (i.e., the time that warps wait for data fetched by memory loads), GPU L1D caches are optimized for density, capacity, and low power with latencies that are often orders of magnitude longer than conventional CPU caches. However, there are many important classes of data-parallel applications (e.g., graph, tree, priority queue processing, and sparse deep learning applications) that benefit from lower load-to-use latency than that offered by modern GPUs due to their inherent divergence and low effective Thread-Level Parallelism (TLP). This paper introduces an innovative on-chip cache hierarchy that incorporates a decoupled L1D cache with reduced latency (LoTUS) and its management scheme. LoTUS is a minimally sized fully associative cache placed in each GPU subcore that captures the primary working set of data-parallel applications. It exploits conventional high-performance low-density SRAM cells and dramatically reduces load-to-use latency. We also propose an intelligent extension of LoTUS, called LoTUSage, which employs a lightweight learning-based model to predict the utility of caching requests in LoTUS. Evaluation results show that LoTUS and LoTUSage improve the average performance by 23.9% and 35.4% and reduce the average energy consumption by 27.8% and 38.5%, respectively, for the applications suffering from high load-to-use stalls with negligible area and power overheads.

Type

research article

DOI

10.1145/3760782

Author(s)

(Nematollahi zadeh) Mahani, Negin (Sadat)

Falahati, Hajar

Darabi, Sina

Javadi-Nezhad, Ahmad

Oh, Yunho

Sadrosadati, Mohammad

Sarbazi-Azad, Hamid

Falsafi, Babak

École Polytechnique Fédérale de Lausanne

Date Issued

2025-08-18

Publisher

Association for Computing Machinery (ACM)

Published in

ACM Transactions on Architecture and Code Optimization

Article Number

3760782

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units

PARSA

Available on Infoscience

August 20, 2025

Use this identifier to reference this record

https://infoscience.epfl.ch/handle/20.500.14299/253230