State Machine Replication: from Analytical Evaluation to High-Performance Paxos
Since their invention more than half a century ago, computers have gone from being just an handful of expensive machines each filling an entire room, to being an integral part of almost every aspect of modern life. Nowadays computers are everywhere: in our planes, in our cars, on our desks, in our home appliances, and even in our pockets. This widespread adoption had a profound impact in our world and in our lives, so much that now we rely on them for many important aspects of everyday life, including work, communication, travel, entertainment, and even managing our money. Given our increased reliance on computers, their continuous and correct operation has become essential for modern society. However, individual computers can fail due to a variety of causes and, if nothing is done about it, these failures can easily lead to a disruption of the service provided by computer system. The field of fault tolerance studies this problem, more precisely, it studies how to enable a computer system to continue operation in spite of the failure of individual components. One of the most popular techniques of achieving fault tolerance is software replication, where a service is replicated on an ensemble of machines (replicas) such that if some of these machines fail, the others will continue providing the service. Software replication is widely used because of its generality (can be applied to most services) and its low cost (can use off-the-shelf hardware). This thesis studies a form of software replication, namely, state machine replication, where the service is modeled as a deterministic state machine whose state transitions consist of the execution of client requests. Although state machine replication was first proposed almost 30 years ago, the proliferation of online services during the last years has led to a renewed interest. Online services must be highly available and for that they frequently rely on state machine replication as part of their fault tolerance mechanisms. However, the unprecedented scale of these services, which frequently have hundreds of thousands or even millions of users, leads to a new set performance requirements on state machine replication. This thesis is organized in two parts. The goal of the first part is to study from a theoretical perspective the performance characteristics of the algorithms behind state machine replication and to propose improved variants of such algorithms. The second part looks at the problem from a practical perspective, proposing new techniques to achieve high-throughput and scalability. In the first part, we start with an analytical analysis of the performance of two consensus algorithms, one leader-free (an adaptation of the fast round of Fast Paxos) and another leader-based (an adaptation of classical Paxos). We express these algorithms in the Heard-Of round model and show that using this model it is fairly easy to determine analytically several interesting performance metrics. We then study the performance of round models in general. Round models are perceived as inefficient because in their typical implementation, the real-time duration of rounds is proportional to the (pessimistic) timeouts used on the underlying system. This contrasts with the failure detector or the partial synchronous system models, where algorithms usually progress at the speed of message reception. We show that there is no inherent gap in performance between the models, by proposing a round implementation that during stable periods advances at the speed of message reception. We conclude the first part by presenting a new leader election algorithm that chooses as leader a well-connected process, that is, a process whose time needed to perform a one-to-majority communication round is among the lowest in the system. This is useful mainly in systems where the latency between processes is not homogeneous, because the performance of leader-based algorithms is particularly sensitive to the performance and connectivity of the process acting as a leader. The second part of the thesis studies different approaches to achieve high-throughput with state machine replication. To support the experimental work done in this part, we have developed JPaxos, a fully-featured implementation of Paxos in Java. We start by looking at how to tune the batching and pipelining optimizations of Paxos; using an analytical model of the performance of Paxos we show how to derive good values for the bounds on the batch size and number of parallel instances. We then propose an architecture for implementing replicated state machines that is capable of leveraging multi-core CPUs to achieve very high-levels of performance. The final contribution of this thesis is based on the observation that most implementations of state machine replication have an unbalanced division of work among threads, with one replica, the leader, having a significantly higher workload than the other replicas. Naturally, the leader becomes the bottleneck of the system, while other replicas are only lightly loaded. We propose and evaluate S-Paxos, which evenly balances the workload among all replicas, and thus overcomes the leader bottleneck. The benefits are two-fold: S-Paxos achieves a higher throughput for a given number of replicas and its performance increases with the number of replicas (up to a reasonable number).
Keywords: Distributed algorithms ; fault tolerance ; consensus problem ; partial synchrony ; heard-of model ; round model ; round-based algorithms ; quantitative analysis ; swift algorithms ; state machine replication ; Paxos ; leader election ; decentralized algorithms ; multi-core ; high-throughput ; scalability ; Algorithmes répartis ; tolérance aux fautes ; problème de consensus ; système partiellement synchrone ; modèle "heard-of" ; modèle en rondes ; algorithmes en rondes ; analyse quantitative ; algorithmes "swift" ; réplication de machines d'état ; Paxos ; élection du leader ; algorithmes décentralisés ; multi-coeurs ; haute performance ; passage à l'échelleThèse École polytechnique fédérale de Lausanne EPFL, n° 5410 (2012)
Programme doctoral Informatique, Communications et Information
Faculté informatique et communications
Institut d'informatique fondamentale
Laboratoire de systèmes répartis
Record created on 2012-08-15, modified on 2016-12-12