Measured Performance of Consistent Checkpointing
Consistent checkpointing provides transparent fault tol erance for longÂrunning distributed applications. In this paper we describe performance measurements of an im plementation of consistent checkpointing. Our measure ments show that consistent checkpointing performs re markably well. We executed eight computeÂintensive dis tributed applications on a network of 16 diskless SunÂ3/60 workstations, comparing the performance without check pointing to the performance with consistent checkpoints taken at 2Âminute intervals. For six of the eight applica tions, the running time increased by less than 1% as a re sult of the checkpointing. The highest overhead measured for any of the applications was 5Â8%. Incremental check pointing and copyÂonÂwrite checkpointing were the most effective techniques in lowering the running time over head. These techniques reduce the amount of data written to stable storage and allow the checkpoint to proceed con currently with the execution of the processes. The over head of synchronizing the individual process checkpoints to form a consistent global checkpoint was much smaller. We argue that these measurements show that consistent checkpointing is an efficient way to provide fault tolerance for longÂrunning distributed applications.
srds92.ps.pdf
openaccess
167.05 KB
Adobe PDF
f2ee2d49ea76454d194718d6f57ec23d