Measured Performance of Consistent Checkpointing

Consistent checkpointing provides transparent fault tol­ erance for long­running distributed applications. In this paper we describe performance measurements of an im­ plementation of consistent checkpointing. Our measure­ ments show that consistent checkpointing performs re­ markably well. We executed eight compute­intensive dis­ tributed applications on a network of 16 diskless Sun­3/60 workstations, comparing the performance without check­ pointing to the performance with consistent checkpoints taken at 2­minute intervals. For six of the eight applica­ tions, the running time increased by less than 1% as a re­ sult of the checkpointing. The highest overhead measured for any of the applications was 5­8%. Incremental check­ pointing and copy­on­write checkpointing were the most effective techniques in lowering the running time over­ head. These techniques reduce the amount of data written to stable storage and allow the checkpoint to proceed con­ currently with the execution of the processes. The over­ head of synchronizing the individual process checkpoints to form a consistent global checkpoint was much smaller. We argue that these measurements show that consistent checkpointing is an efficient way to provide fault tolerance for long­running distributed applications.

Presented at:
Proceedings of the Eleventh Symposium on Reliable Distributed Systems, October 1992

 Record created 2005-10-20, last modified 2018-03-17

Download fulltext

Rate this document:

Rate this document:
(Not yet reviewed)