HovercRaft: Achieving Scalability and Fault-tolerance for microsecond-scale Datacenter Services

Kogias, Marios; Bugnion, Edouard

doi:10.1145/3342195.3387545

conference paper not in proceedings

HovercRaft: Achieving Scalability and Fault-tolerance for microsecond-scale Datacenter Services

Kogias, Marios

•

Bugnion, Edouard

2020

EuroSys 2020

Cloud platform services must simultaneously be scalable, meet low tail latency service-level objectives, and be resilient to a combination of software, hardware, and network failures. Replication plays a fundamental role in meeting both the scalability and the fault-tolerance requirement, but is subject to opposing requirements: (1) scalability is typically achieved by relaxing consistency; (2) fault-tolerance is typically achieved through the consistent replication of state machines. Adding nodes to a system can therefore either in- crease performance at the expense of consistency, or increase resiliency at the expense of performance. We propose HovercRaft, a new approach by which adding nodes increases both the resilience and the performance of general-purpose state-machine replication. We achieve this through an extension of the Raft protocol that carefully eliminates CPU and I/O bottlenecks and load balances requests. Our implementation uses state-of-the-art kernel-bypass techniques, datacenter transport protocols, and in-network programmability to deliver up to 1 million operations/second for clusters of up to 9 nodes, linear speedup over unreplicated configuration for selected workloads, and a 4× speedup for the YCSBE-E benchmark running on Redis over an unreplicated deployment.

Name

paper.pdf

Type

Publisher's Version

Version

Published version

Access type

openaccess

License Condition

Copyright

Size

749.63 KB

Format

Adobe PDF

Checksum (MD5)

ba31af60c2f5b2823b747145bd1a53c8