Building Strongly-Consistent Systems Resilient to Failures, Partitions, and Slowdowns
Distributed systems designers typically strive to improve performance and preserve availability despite failures or attacks; but, when strong consistency is also needed, they encounter fundamental limitations. The bottleneck is in replica coordination, which is impacted by partitions and slowdowns that can occur anywhere. We believe the present ecosystem fails to recognize that not all failures and partitions are supposed to be equal - at least from a user-centric performance and availability standpoint. Failures distant from a user should intuitively be less likely to affect that user. Today's ecosystem fails this test, however, despite high-availability best practices. For example, correlated and cascading failures, caused by misconfiguration, bugs, and network partitions, often invalidate the cloud's assumptions of failure independence. Likewise, large-scale or accurately targeted routing or denial of service attacks can slow or halt a distributed ledger or compromise its security.
We believe that distributed systems designers and practitioners can and should build reliable, responsive systems by making Lamport exposure and asynchrony central design considerations. We propose that distributed services need not and should not expose local activities to distant failures or partitions, no matter how severe. Limix is the first exposure-limiting metadata configuration service that addresses this problem. Limix insulates global strongly-consistent data-plane services and objects from remote failures and partitions by ensuring that the definitive, strongly-consistent metadata for every object is always confined to the same zone as the object itself. Nyle is a trust-but-verify distributed ledger architecture that limits transaction exposure. While employing similar design principles as Limix, Nyle additionally supports an environment with Byzantine nodes and potentially compromised regions with an elevated presence of attackers, and enables asymmetric user trust preferences. Both Limix and Nyle outperform related work in terms of availability, at a manageable overhead. We also demonstrate, through the design of QSC, that consensus protocols can deal with network asynchrony without relying on common coins, having the potential to make consensus more responsive and more practical.
EPFL_TH8595.pdf
n/a
openaccess
copyright
9.75 MB
Adobe PDF
d1309bd931540c2a80c8310e840b75d9