Measured Performance of Consistent Checkpointing
Consistent checkpointing provides transparent fault tol erance for longrunning distributed applications. In this paper we describe performance measurements of an im plementation of consistent checkpointing. Our measure ments show that consistent checkpointing performs re markably well. We executed eight computeintensive dis tributed applications on a network of 16 diskless Sun3/60 workstations, comparing the performance without check pointing to the performance with consistent checkpoints taken at 2minute intervals. For six of the eight applica tions, the running time increased by less than 1% as a re sult of the checkpointing. The highest overhead measured for any of the applications was 58%. Incremental check pointing and copyonwrite checkpointing were the most effective techniques in lowering the running time over head. These techniques reduce the amount of data written to stable storage and allow the checkpoint to proceed con currently with the execution of the processes. The over head of synchronizing the individual process checkpoints to form a consistent global checkpoint was much smaller. We argue that these measurements show that consistent checkpointing is an efficient way to provide fault tolerance for longrunning distributed applications.
Record created on 2005-10-20, modified on 2016-08-08