SPBC: Leveraging the Characteristics of MPI HPC Applications for Scalable Checkpointing

Ropars, Thomas; Martsinkevich, Tatiana; Guermouche, Amina; Schiper, André; Cappello, Franck

doi:10.1145/2503210.2503271

conference paper

SPBC: Leveraging the Characteristics of MPI HPC Applications for Scalable Checkpointing

Ropars, Thomas

•

Martsinkevich, Tatiana

•

Guermouche, Amina

more

2013

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

International Conference for High Performance Computing, Networking, Storage and Analysis (SC'13)

The high failure rate expected for future supercomputers requires the design of new fault tolerant solutions. Most checkpointing protocols are designed to work with any message-passing application but suffer from scalability issues at extreme scale. We take a different approach: We identify a property common to many HPC applications, namely channel-determinism, and introduce a new partial order relation, called always-happens-before relation, between events of such applications. Leveraging these two concepts, we design a protocol that combines an unprecedented set of features. Our protocol called SPBC combines in a hierarchical way coordinated checkpointing and message logging. It is the first protocol that provides failure containment without logging any information reliably apart from process checkpoints, and this, without penalizing recovery performance. Experiments run with a representative set of HPC workloads demonstrate a good performance of our protocol during both, failure-free execution and recovery.

Name

sc2013.pdf

Type

Preprint

Access type

openaccess

Size

433.12 KB

Format

Adobe PDF

Checksum (MD5)

7cce7083aaa58574aa30854b2ca1697b