Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. SPBC: Leveraging the Characteristics of MPI HPC Applications for Scalable Checkpointing
 
conference paper

SPBC: Leveraging the Characteristics of MPI HPC Applications for Scalable Checkpointing

Ropars, Thomas  
•
Martsinkevich, Tatiana
•
Guermouche, Amina
Show more
2013
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
International Conference for High Performance Computing, Networking, Storage and Analysis (SC'13)

The high failure rate expected for future supercomputers requires the design of new fault tolerant solutions. Most checkpointing protocols are designed to work with any message-passing application but suffer from scalability issues at extreme scale. We take a different approach: We identify a property common to many HPC applications, namely channel-determinism, and introduce a new partial order relation, called always-happens-before relation, between events of such applications. Leveraging these two concepts, we design a protocol that combines an unprecedented set of features. Our protocol called SPBC combines in a hierarchical way coordinated checkpointing and message logging. It is the first protocol that provides failure containment without logging any information reliably apart from process checkpoints, and this, without penalizing recovery performance. Experiments run with a representative set of HPC workloads demonstrate a good performance of our protocol during both, failure-free execution and recovery.

  • Files
  • Details
  • Metrics
Type
conference paper
DOI
10.1145/2503210.2503271
Author(s)
Ropars, Thomas  
Martsinkevich, Tatiana
Guermouche, Amina
Schiper, André  
Cappello, Franck
Date Issued

2013

Published in
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Subjects

HPC

•

MPI

•

Fault tolerance

•

Checkpointing

•

Channel determinism

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
LSR-IC  
Event nameEvent placeEvent date
International Conference for High Performance Computing, Networking, Storage and Analysis (SC'13)

Denver, Colorado, USA

November, 2013

Available on Infoscience
October 10, 2013
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/96178
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés