Replication for Send-Deterministic MPI HPC Applications

Replication has recently gained attention in the context of fault tolerance for large scale MPI HPC applications. Existing implementations try to cover all MPI codes and to be independent from the underlying library. In this paper, we evaluate the advantages of adopting a different approach. First, we try to take advantage of a communication property common to many MPI HPC application, namely send-determinism. Second, we choose to implement replication inside the MPI library. The main advantage of our approach is simplicity. While being only a small patch to the Open MPI library, our solution called SDR-MPI supports most main features of the MPI standard including all collectives and group operations. SDR-MPI additionally achieves good performance: Experiments run with HPC benchmarks and applications show that its overhead remains below 5%.

Presented at:
3rd Workshop on Fault-Tolerance for HPC at Extreme Scale, New-York City, USA, June, 2013

 Record created 2013-10-10, last modified 2018-01-28

External link:
Download fulltext
Rate this document:

Rate this document:
(Not yet reviewed)