Files

Abstract

Consistent checkpointing provides transparent fault tol­ erance for long­running distributed applications. In this paper we describe performance measurements of an im­ plementation of consistent checkpointing. Our measure­ ments show that consistent checkpointing performs re­ markably well. We executed eight compute­intensive dis­ tributed applications on a network of 16 diskless Sun­3/60 workstations, comparing the performance without check­ pointing to the performance with consistent checkpoints taken at 2­minute intervals. For six of the eight applica­ tions, the running time increased by less than 1% as a re­ sult of the checkpointing. The highest overhead measured for any of the applications was 5­8%. Incremental check­ pointing and copy­on­write checkpointing were the most effective techniques in lowering the running time over­ head. These techniques reduce the amount of data written to stable storage and allow the checkpoint to proceed con­ currently with the execution of the processes. The over­ head of synchronizing the individual process checkpoints to form a consistent global checkpoint was much smaller. We argue that these measurements show that consistent checkpointing is an efficient way to provide fault tolerance for long­running distributed applications.

Details