Failure Detectors: Implementation issues and impact on consensus performance

Failure Detectors: implementation issues and impact on consensus performance Nicole Sergent, Xavier Defago, and Andre Schiper ABSTRACT Due to their nature, distributed systems are vulnerable to failures of some of their parts. Conversely, distribution also provides a way to increase the fault tolerance of the overall system. However, achieving fault tolerance is not a simple problem and requires complex techniques. An agreement problem known as the problem of consensus is at the heart of most problems encountered during the design of a fault tolerant system. This problem is however not solvable in the asynchronous system model, unless the model is augmented with adequate failure detectors. The resulting system model is a time-free model since all timing issues are abstracted by the characteristics of the failure detectors. It is sometimes claimed that time-based system models are more realistic than time-free models for solving distributed agreement problems. The goal of this paper is to show that solving consensus in the asynchronous system model augmented with failure detectors does not prevent from considering timing issues. We consider the consensus algorithm with various implementations of failure detectors, and we analyse their impact on the termination time of the consensus algorithm. This study shows that the design of fault-tolerant distributed algorithms in the asynchronous system model augmented with failure detectors is _orthogonal_ to the issue of implementing the actual failure detectors. This nicely decouples logical issues (proof of safety and liveness of an algorithm) from engineering issues (e.g., performance and timing constraints).


    • LSR-REPORT-1999-008

    Record created on 2005-05-20, modified on 2016-08-08

Related material