Tail-tolerance as a Systems Principle not a Metric

Bugnion, Edouard

doi:10.1145/3411029.3411032

conference paper

Tail-tolerance as a Systems Principle not a Metric

Kogias, Marios

•

Bugnion, Edouard

January 1, 2020

Proceedings Of 2020 4Th Asia-Pacific Workshop On Networking, Apnet 2020

4th Asia-Pacific Workshop on Networking (APNet)

Tail-latency tolerance (or just simply tail-tolerance) is the ability for a system to deliver a response with low-latency nearly all the time. It it typically expressed as a system metric (e.g., the 99th or 99.99th percentile latency) or as a service-level objective (e.g., the maximum throughput so that the tail latency is below a desired threshold). We advocate instead that modern datacenter systems should incorporate tail-tolerance as a core systems design principle and not a metric to be observed, and that tail-tolerant systems can be built out of large and complex applications whose individual components may suffer from latency deviations. This is analogous to fault-tolerance, where a fault-tolerant system can be built out of unreliable components. The general solution is for the system to control the applied load and keep it under the threshold that violates the latency SLO. We propose to augment RPC semantics with an architectural layer that measures the observed tail latency and probabilistically rejects RPC requests maintaining throughput under the threshold that violates the SLO. Our design is application-independent, and does not make any assumptions about the request service time distribution. We implemented a proof of concept for such a tail-tolerant layer using programmable switches, called SVEN. We demonstrate that the approach is suitable even for microsecond-scale RPCs with variable service times. Moreover, our approach does not induce measurable overheads, and can maintain the maximum achieved throughput very close to the load level that would violate the SLO without SVEN.