Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Journal articles
  4. A Low-latency On-chip Cache Hierarchy for Load-to-use Stall Reduction in GPUs
 
research article

A Low-latency On-chip Cache Hierarchy for Load-to-use Stall Reduction in GPUs

(Nematollahi zadeh) Mahani, Negin (Sadat)
•
Falahati, Hajar
•
Darabi, Sina
Show more
August 18, 2025
ACM Transactions on Architecture and Code Optimization

Memory hierarchy in Graphics Processing Units (GPUs) is conventionally designed to provide high bandwidth rather than low latency. In particular, because of the high tolerance to load-to-use latency (i.e., the time that warps wait for data fetched by memory loads), GPU L1D caches are optimized for density, capacity, and low power with latencies that are often orders of magnitude longer than conventional CPU caches. However, there are many important classes of data-parallel applications (e.g., graph, tree, priority queue processing, and sparse deep learning applications) that benefit from lower load-to-use latency than that offered by modern GPUs due to their inherent divergence and low effective Thread-Level Parallelism (TLP). This paper introduces an innovative on-chip cache hierarchy that incorporates a decoupled L1D cache with reduced latency (LoTUS) and its management scheme. LoTUS is a minimally sized fully associative cache placed in each GPU subcore that captures the primary working set of data-parallel applications. It exploits conventional high-performance low-density SRAM cells and dramatically reduces load-to-use latency. We also propose an intelligent extension of LoTUS, called LoTUSage, which employs a lightweight learning-based model to predict the utility of caching requests in LoTUS. Evaluation results show that LoTUS and LoTUSage improve the average performance by 23.9% and 35.4% and reduce the average energy consumption by 27.8% and 38.5%, respectively, for the applications suffering from high load-to-use stalls with negligible area and power overheads.

  • Details
  • Metrics
Type
research article
DOI
10.1145/3760782
Author(s)
(Nematollahi zadeh) Mahani, Negin (Sadat)
Falahati, Hajar
Darabi, Sina
Javadi-Nezhad, Ahmad
Oh, Yunho
Sadrosadati, Mohammad
Sarbazi-Azad, Hamid
Falsafi, Babak  

École Polytechnique Fédérale de Lausanne

Date Issued

2025-08-18

Publisher

Association for Computing Machinery (ACM)

Published in
ACM Transactions on Architecture and Code Optimization
Article Number

3760782

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
PARSA  
Available on Infoscience
August 20, 2025
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/253230
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés