Hybrid, Job-Aware, and Preemptive Datacenter Scheduling
Scheduling in datacenters is an important, yet challenging problem. Datacenters are composed of a large number, typically tens of thousands, of commodity computers running a variety of data-parallel jobs. The role of the scheduler is to assign cluster resources to jobs, which is not trivial due to the large scale of the cluster, as well as the high scheduling load (tens of thousands of scheduling decisions per second).
Additionally to scalability, modern datacenters face increasingly heterogeneous workloads composed of long batch jobs, e.g., data analytics, and latency-sensitive short jobs, e.g., operations of user-facing services. In such workloads, and especially if the cluster is highly utilized, it is challenging to avoid short running jobs getting stuck behind long running jobs, i.e. head-of-line blocking.
Schedulers have evolved from being centralized (one single scheduler for the entire cluster) to distributed (many schedulers that take scheduling decisions in parallel). Although distributed schedulers can handle the large-scale nature of datacenters, they trade scheduling latency for accuracy.
The complexity of scheduling in datacenters is exacerbated by the data-parallel nature of the jobs. That is, a job is composed of multiple tasks and the job completes only when all of its tasks complete. A scheduler that takes into account this fact, i.e. job-aware, could use this information to provide better scheduling decisions.
Furthermore, to improve the quality of their scheduling decisions, most of datacenter schedulers use job runtime estimates. Obtaining accurate runtime estimates is, however, far from trivial, and erroneous estimates may lead to sub-optimal scheduling decisions.
Considering these challenges, in this dissertation we argue the following:
(i) that a hybrid centralized/distributed design can get the best of both worlds by scheduling long jobs in a centralized way and short jobs in a distributed way;
(ii) such a hybrid scheduler can avoid head-of-line blocking and provide job-awareness by dynamically partitioning the cluster for short and long jobs and by executing a job to completion once it started;
(iii) a scheduler can dispense with runtime estimates by sharing the resources of a node with preemption, and load balancing jobs among the nodes.
EPFL_TH8892.pdf
openaccess
2.09 MB
Adobe PDF
438eeb4d9725c12abf2111370ffccfd2