Stop Wasting my FLOPS: Improving the Efficiency of Deep Learning Models
Deep neural networks have completely revolutionized the field of machine
learning by achieving state-of-the-art results on various tasks ranging from
computer vision to protein folding. However, their application is hindered by
their large computational and memory requirements. In this thesis, we propose
methods for improving the efficiency of deep neural networks.
Firstly, we tackle the sample inefficiency of neural network training with an
importance sampling algorithm suitable for deep neural networks. This algorithm
allows us to focus computation on datapoints that are going to provide useful
gradients for training our models and ignore the ones that will have negligible
gradients. We show that our algorithm can improve the performance of various
neural networks when compared to uniform sampling under a fixed computational
budget.
Secondly, we design a model that is suitable for processing large input images
with a fraction of the computational and memory requirements of traditional
approaches. We achieve this by sampling from a data-dependent attention
distribution in order to only process a portion of the input in high
resolution. We demonstrate that our model can learn both the attention and the
features in an end-to-end fashion using only single image-wise labels for
supervision.
Subsequently, we shift our attention to transformer architectures and introduce
a kernelized formulation for self-attention that reduces its quadratic
complexity to linear with respect to the input sequence's length. Furthermore,
we uncover the relationship between autoregressive transformers and recurrent
neural networks and show that our formulation enables up to 3 orders of
magnitude faster autoregressive inference.
Finally, we develop clustered, attention a method that can approximate softmax
transformers with reduced computation. This is achieved by grouping elements of
the input using clustering. We showcase that our formulation provides a better
trade-off between performance and computation in comparison to the original
transformer architecture. In addition, we demonstrate that clustered attention
can approximate pretrained transformer models without any fine-tuning and with
minimal loss in performance.
EPFL_TH8607.pdf
N/a
openaccess
copyright
10.35 MB
Adobe PDF
11820c0ac10a53624d5a8746f3c2fb4c