System Support for Efficient Replication in Distributed Systems

Current online applications, such as search engines, social networks, or file sharing services, execute across a distributed network of machines. They provide non-stop services to their users despite failures in the underlying network. To achieve such a high level of reliability, these applications rely on a simple technique: replication. Briefly, there are copies of the application on multiple machines, such that no specific machine represents a single point of failure. Under this apparent simplicity, however, reliability comes at a steep price. This is because replication entails certain side-effects (e.g., overheads or costs, inconsistencies, or tradeoffs) which often lead to inefficient designs and want for better performance. The high-level problem we study in this dissertation is that of obtaining good performance in replicated systems. We seek to reduce--€”and when irreducible, hide--€”the effects of replication. We approach this problem from three fronts, as follows. First, we look at State Machine Replication (SMR). This is a classic technique for achieving strong consistency in replicated systems. Similar to other techniques for strong consistency, SMR brings a substantial performance overhead. Additionally, this overhead gets worse with growing system size: There is a performance decay as an SMR system gets larger. We shed light on this issue by deploying five SMR systems on up to 100 replicas and reporting on their performance decay. Towards mitigating this issue, we introduce Carousel, an SMR system based on a ring overlay network which can alleviate performance decay. Second, we discuss a technique for hiding the cost of strong consistency. Given the high overhead of accessing data with strong consistency, we propose that applications combine multiple consistency models. We introduce programming support for doing so, through an abstraction called Correctables. In conjunction with a speculation-based technique, we show how Correctables can lower the latency for strongly-consistent operations. Third, we propose a technique to bypass SMR (and avoid its costs) in a concrete replicated application. We focus on the problem of implementing token transfer applications (e.g., online payments). We introduce the abstraction of exclusive token accounts, or Exa, supporting asynchronous transfers. We also design and build Astro, a system implementing the Exa abstraction. This system departs from classic consensus-based solutions (i.e., SMR), and relies instead on a broadcast primitive, a more efficient and tractable building block.

Guerraoui, Rachid
Lausanne, EPFL

Note: The status of this file is: Anyone

 Record created 2019-09-05, last modified 2020-10-24

Download fulltext

Rate this document:

Rate this document:
(Not yet reviewed)