The Overhead of Consensus Failure Recovery
Many reliable distributed systems are consensus-based and typically operate under two modes: a fast normal mode in failure-free periods, and a slower recovery mode following failures. A lot of work has been devoted to optimizing the normal mode, but little has been devoted to optimizing the recovery mode. This work seeks to understand whether the recovery mode is inherently slower than the normal mode. We focus on consensus algorithms that tolerate arbitrary periods of asynchrony as well as the crash of a minority of processes, and study the time-complexity of their recovery mode in periods of synchrony. We show that recovery induces an inherent overhead of one communication round. More precisely, we show that consensus algorithms that can tolerate arbitrary periods of asynchrony, require three rounds even in synchronous runs where all crashes are initial. This is in contrast to the well-known tight bound of two rounds on failure-free synchronous runs of such algorithms. We also give for the first time a consensus algorithm that is optimal in both the normal mode and the recovery mode of synchronous runs. Moreover, our algorithm recovers from arbitrary periods of asynchrony significantly faster than any other consensus algorithm we are familiar with. The algorithm uses the weakest failure detector for consensus, Omega.