Files

Abstract

Deep neural networks (DNN) have become an essential tool to tackle challenging tasks in many fields of computer science. However, their high computational complexity limits their applicability. Specialized DNN accelerators have been developed to accommodate the high computational complexity of DNN inference, but the mismatch between the accelerators and DNN models prevents unlocking their full potential. In this thesis, we address the mismatch between accelerators and DNN models from both hardware and software perspectives. First, we investigate one of the most widely used architectures in DNN accelerators, i.e., systolic arrays, and identify the leading cause of underperformance in DNN inference, namely dimension mismatches between arrays and DNN layers. We analyze the characteristics of today's popular DNN models, perform an extensive design-space exploration and propose a novel scale-out systolic array architecture that maximizes the effective throughput (i.e., FLOPS per second) for a given set of target DNN workloads. Then, we go beyond the limits of what can be achieved with hardware optimization and focus on optimizing DNN architectures to improve the resource utilization at the target accelerators. To that end, we study differentiable neural architecture search frameworks, which automate the creation of DNN architectures using efficient gradient-descent optimizers. We introduce a novel computational model for the utilization of systolic arrays and propose a novel utilization-aware neural architecture search framework. The proposed framework is capable of creating DNN models with improved resource utilization on target accelerators, which permits to perform DNN inference more efficiently and faster. The existing neural architecture search framework searches for channel dimensions of a DNN model in a fixed search space, which requires to manually design complex search spaces. However, designing a search space is a nontrivial task that requires heuristics and domain expertise, which undermines the effectiveness and practicality of neural architecture search. To eliminate the necessity to predefine search spaces, we propose a flexible channel masking method, which dynamically adjusts the search space based on the progress of architecture search. We introduce the differentiable neural architecture search framework that uses the flexible channel masking method, which obviates the need for manually designing a search space. We demonstrate through extensive experiments that the proposed framework significantly reduces the search time and memory requirements compared to the existing neural architecture search framework with fixed search spaces. Overall, this thesis proposes hardware and software co-design techniques to improve the performance of DNN inference. We demonstrate that the proposed scale-out systolic array architecture combined with DNNs optimized using the proposed neural architecture search frameworks exhibits significantly higher resource utilization. Consequently, the proposed methods enable faster and more efficient DNN inference, improving the effectiveness of DNN applications on resource-constraint platforms.

Details

PDF