Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications
 
conference paper

HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications

Guermouche, Amina
•
Ropars, Thomas  
•
Snir, Marc  
Show more
2012
2012 IEEE 26th International Parallel and Distributed Processing Symposium
26th IEEE International Parallel & Distributed Processing Symposium

High performance computing will probably reach exascale in this decade. At this scale, mean time between failures is expected to be a few hours. Existing fault tolerant protocols for message passing applications will not be efficient anymore since they either require a global restart after a failure (checkpointing protocols) or result in huge memory occupation (message logging). Hybrid fault tolerant protocols overcome these limits by dividing applications processes into clusters and applying a different protocol within and between clusters. Combining coordinated checkpointing inside the clusters and message logging for the inter-cluster messages allows confining the consequences of a failure to a single cluster, while logging only a subset of the messages. However, in existing hybrid protocols, event logging is required for all application messages to ensure a correct execution after a failure. This can significantly impair failure free performance. In this paper, we propose HydEE, a hybrid rollback-recovery protocol for send-deterministic message passing applications, that provides failure containment without logging any event, and only a subset of the application messages. We prove that HydEE can handle multiple concurrent failures by relying on the send-deterministic execution model. Experimental evaluations of our implementation of HydEE in the MPICH2 library show that it introduces almost no overhead on failure free execution.

  • Files
  • Details
  • Metrics
Type
conference paper
DOI
10.1109/IPDPS.2012.111
Author(s)
Guermouche, Amina
Ropars, Thomas  
Snir, Marc  
Cappello, Franck
Date Issued

2012

Published in
2012 IEEE 26th International Parallel and Distributed Processing Symposium
Start page

1216

End page

1227

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
LSR-IC  
Event nameEvent placeEvent date
26th IEEE International Parallel & Distributed Processing Symposium

Shanghai, China

May 21-25, 2012

Available on Infoscience
May 17, 2012
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/80518
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés