The Overhead of Indulgent Failure Recovery
Many reliable distributed systems are consensus-based and typically operate under two modes: a fast normal mode in failure-free synchronous periods, and a slower recovery mode following asynchrony and failures. A lot of work has been devoted to optimize the normal mode, but little has focused on optimizing the recovery mode. This paper seeks to understand whether the recovery mode is inherently slower than the normal mode. In particular, we consider consensus algorithms in the round-based eventually synchronous model of [DLS88], where t out of n processes may fail by crashing, messages may be lost, and the system may be asynchronous for arbitrarily long, but eventually the system becomes synchronous and no new failure occurs (we say that the system becomes stable). For t >= n/3, we prove a lower bound of three rounds for achieving a global decision whenever the system becomes stable, and we contrast this with a bound of two rounds when t < n/3. We then give matching algorithms for both t >= n/3 and t < n/3.
Record created on 2006-10-05, modified on 2016-08-08