## The Complexity of a Reliable Distributed System

Studying the complexity of distributed algorithms typically boils down to evaluating how the number of messages exchanged (resp. communication steps performed or shared memory operations executed) by nodes to reliably achieve some common task, evolves with the number $n$ of these nodes. But what about the complexity of building the distributed system itself? How does the number of physical network components (e.g., channels and intermediary nodes acting as routers), needed for building a system of $n$ nodes to ensure some global reliable connectivity property, evolves with $n$? Addressing such a question lies at the heart of achieving the dream of \emph{elasticity} in so-called \emph{cloud computing}. In this paper, we show for the first time how to construct a distributed system of which any two of the $n$ nodes, for any $n$, remain connected (i.e., able to communicate) with probability at least $\mu$, despite the very fact that (a) every other node or channel has an independent probability $\lambda$ of failing, and (b) the number of channels connected to every node is physically bounded by a constant. We show however that if we also require any two of the $n$ nodes to maintain a balanced message throughput with a constant probability, then $O(n \log^{1+\epsilon} n)$ additional intermediary nodes are necessary and sufficient, where $\epsilon$ is an arbitrarily small constant. Our distributed system constructions, based on the composition of fractal and tree-like graphs, are not claimed to be simple and cost-effective enough to constitute the architectural blueprints of the next generation cloud data centers with millions of computers. Yet, they might constitute their theoretical backbone.

Year:
2017
Keywords:
Laboratories:

Note: The status of this file is: Anyone