ColTraIn: Co-located DNN training and inference

Drumond Lages De Oliveira, Mario Paulo

doi:10.5075/epfl-thesis-10265

doctoral thesis

ColTraIn: Co-located DNN training and inference

2020

Deep neural network inference accelerators are deployed at scale to accommodate online services, but face low average load because of service demand variability, leading to poor resource utilization. Unfortunately, reclaiming inference idle cycles is difficult, as no other workload can execute on such custom accelerators. DNN training services offer opportunities to reclaim inference accelerator idle cycles. However, the inference services' tight latency constraints and the training algorithms' dependence on floating-point arithmetic limit the opportunities for piggybacking training services on inference accelerators.

In this thesis, we tackle the challenges that prevent inference DNN accelerators from exposing their idle cycles to training services. We first develop an efficient numeric representation that enables DNN training with accuracy similar to single-precision floating point and energy efficiency similar to 8-bit fixed point. Then, we explore the inference accelerator design space to show that, unlike in current latency-optimal platforms, relaxing latency constraints with ALU arrays that are batching-optimized achieves near-optimal throughput for a given area and power envelope. High throughput inference accelerators maximize the opportunities to piggyback training. Finally, we present Equinox, a family of inference accelerators designed to piggyback training. Equinox employs a uniform encoding and a priority hardware scheduler that processes training requests during inference idle cycles without affecting inference tail latency. Overall, we show that exposing accelerator idle cycles to training services uncovers significant computing power for training services with a small overhead for inference accelerators, improving overall datacenter efficiency.

Use this identifier to reference this record

https://infoscience.epfl.ch/handle/20.500.14299/171775

Name

EPFL_TH10265.pdf

Access type

openaccess

Size

2.24 MB

Format

Adobe PDF

Checksum (MD5)

7af015943e649d083e7e41949e4f6e4b