Predictive Reliability and Fault Management in Exascale Systems

Canal, Ramon; Hernandez, Carles; Tornero, Rafa; Cilardo, Alessandro; Massari, Giuseppe; Reghenzani, Federico; Fornaciari, William; Zapater, Marina; Atienza, David; Oleksiak, Ariel; PiĄtek, Wojciech; Abella, Jaume

doi:10.1145/3403956

Canal, Ramon; Hernandez, Carles; Tornero, Rafa; Cilardo, Alessandro; Massari, Giuseppe; Reghenzani, Federico; Fornaciari, William; Zapater, Marina; Atienza, David; Oleksiak, Ariel; PiĄtek, Wojciech; Abella, Jaume

2020

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

Performance and power constraints come together with Complementary Metal Oxide Semiconductor technology scaling in future Exascale systems. Technology scaling makes each individual transistor more prone to faults and, due to the exponential increase in the number of devices per chip, to higher system fault rates. Consequently, High-performance Computing (HPC) systems need to integrate prediction, detection, and recovery mechanisms to cope with faults efficiently. This article reviews fault detection, fault prediction, and recovery techniques in HPC systems, from electronics to system level. We analyze their strengths and limitations. Finally, we identify the promising paths to meet the reliability levels of Exascale systems.

Details

Title Predictive Reliability and Fault Management in Exascale Systems

Author(s) Canal, Ramon ; Hernandez, Carles ; Tornero, Rafa ; Cilardo, Alessandro ; Massari, Giuseppe ; Reghenzani, Federico ; Fornaciari, William ; Zapater, Marina ; Atienza, David ; Oleksiak, Ariel ; PiĄtek, Wojciech ; Abella, Jaume

Published in ACM Computing Surveys

Volume 53

Issue 5

Pages 1-32

Date 2020-12-01

Keywords

Computer Systems Organization; Grid Computing; Reliability; Fault Management; Exascale

DOI https://doi.org/10.1145/3403956

Other identifier(s) DOI: https://doi.org/10.1145/3403956

Laboratories ESL

Record Appears in Scientific production and competences > STI - School of Engineering > IEM - Institut d'Electricité et de Microtechnique > ESL - Embedded Systems Laboratory
Peer-reviewed publications
Work produced at EPFL
Journal Articles
Published

Record creation date 2020-10-08

Files

Abstract

Details

PDF