Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Journal articles
  4. Predictive Reliability and Fault Management in Exascale Systems
 
research article

Predictive Reliability and Fault Management in Exascale Systems

Canal, Ramon
•
Hernandez, Carles
•
Tornero, Rafa
Show more
December 1, 2020
ACM Computing Surveys

Performance and power constraints come together with Complementary Metal Oxide Semiconductor technology scaling in future Exascale systems. Technology scaling makes each individual transistor more prone to faults and, due to the exponential increase in the number of devices per chip, to higher system fault rates. Consequently, High-performance Computing (HPC) systems need to integrate prediction, detection, and recovery mechanisms to cope with faults efficiently. This article reviews fault detection, fault prediction, and recovery techniques in HPC systems, from electronics to system level. We analyze their strengths and limitations. Finally, we identify the promising paths to meet the reliability levels of Exascale systems.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

CSUR2020-postprint.pdf

Type

Postprint

Version

Accepted version

Access type

openaccess

License Condition

CC BY

Size

647.49 KB

Format

Adobe PDF

Checksum (MD5)

52f5bec44f01bcc82cd765f12e4c0ef3

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés