Solving consensus from fair-lossy channels to crash-recovery of processes

Distributed systems are the basis of widespread computing facilities enabling many of our daily life activities. Telebanking, electronic commerce, online booking-reservation, and telecommunication are examples of common services that rely on distributed systems in order to achieve their desirable ubiquity. The success of these services increasingly depends on fault-tolerant systems to preserve their availability and reliability despite partial failures. Fault tolerance can be achieved by introducing redundancy into the system. This leads to the need of coordinating replicated components which has been one of the biggest trends in software technologies. This thesis is about process coordination in fault-tolerant distributed systems. In particular, it focuses on the problem of reaching agreement among processes considering the well-known Consensus problem. Consensus is regarded as an abstract problem whose solutions embody the complexity inherent to most agreement problems. This thesis studies the solvability of consensus in asynchronous network systems augmented with a failure detector oracle. Both, the processes and the network, are subject to benign failures. The thesis' overall contribution is a consensus algorithm that tolerates process crash and network omission failures, and exploits the recovery of crashed process. A contribution of this thesis is the definition of a novel type of weakly reliable communication channels, called Stubborn. The semantics of Stubborn channels reflect the reliability on communication required to solve consensus. Furthermore, it is shown that an algorithm designed with Stubborn in-stead of reliable channels can offer the same number of messages exchanged in favorable runs and avoid communication steps in disadvantageous runs. Stubborn channels can be easily implemented on top of existing network layers and make reasonable requirements on message buffering space. To handle the crash and recovery of processes it is proposed a model which encompasses the different patterns of crashes and recoveries that can be exhibited by the processes. The approach considers two different problems concerning crash and recovery: (1) processes that can be disruptive because they are always crashing and recovering, and (2) amnesia failures caused by crashes. The solution for the first problem consists in the use of a new class of failure detectors distinguished by its recurrent strong completeness property. To cope with the crash and recovery of any process involved, process amnesia cannot be arbitrary and thus processes are equipped with a stable storage device. The proposed algorithm judiciously specifies accesses to stable storage, and by exploiting the recovery of processes it can solve consensus in scenarios where other algorithms would otherwise block.

Schiper, André
Lausanne, EPFL

Note: The status of this file is: EPFL only

 Record created 2005-03-16, last modified 2018-01-27

Rate this document:

Rate this document:
(Not yet reviewed)