Theory of Deep Learning: Neural Tangent Kernel and Beyond
In the recent years, Deep Neural Networks (DNNs) have managed to succeed at tasks that previously appeared impossible, such as human-level object recognition, text synthesis, translation, playing games and many more. In spite of these major achievements, our understanding of these models, in particular of what happens during their training, remains very limited.
This PhD started with the introduction of the Neural Tangent Kernel (NTK) to describe the evolution of the function represented by the network during training. In the infinite-width limit, i.e. when the number of neurons in the layers of the network grows to infinity, the NTK converges to a deterministic and time-independent limit, leading to a simple yet complete description of the dynamics of infinitely-wide DNNs. This allowed one to give the first general proof of convergence of DNNs to a global minimum, and yielded the first description of the limiting spectrum of the Hessian of the loss surface of DNNs throughout training.
More importantly, the NTK plays a crucial role in describing the generalization abilities of DNNs, i.e. the performance of the trained network on unseen data. The NTK analysis uncovered a direct link between the function learned by infinitely wide DNNs and Kernel Ridge Regression predictors, whose generalization properties are studied in this thesis using tools of random matrix theory.
Our analysis of KRR reveals the importance of the eigendecomposition of the NTK, which is affected by a number of architectural choices. In very deep networks, an ordered regime and a chaotic regime appear, determined by the choice of non-linearity and the balance between the weights and bias parameters; these two phases are characterized by different speeds of decay of the eigenvalues of the NTK, leading to a tradeoff between convergence speed and generalization. In practical contexts such as Generative Adversarial Networks or Topology Optimization, the network architecture can be chosen to guarantee certain properties of the NTK and its spectrum.
These results give an almost complete description DNNs in this infinite-width limit. It is then natural to wonder how it extends to finite-width networks used in practice. In the so-called NTK regime, the discrepancy between finite- and infinite-widths DNNs is mainly a result of the variance w.r.t. to the sampling of the parameters, as shown empirically and mathematically relying on the similarity between DNNs and random feature models.
In contrast to the NTK regime, where the NTK remains constant during training, there exist so-called active regimes, where the evolution of the NTK is significant, which appear in a number of settings. One such regime appears in Deep Linear Networks with a very small initialization, where the training dynamics approach a sequence of saddle-points, representing linear maps of increasing rank, leading to a low-rank bias which is absent in the NTK regime.
EPFL_TH9825.pdf
n/a
openaccess
copyright
13.44 MB
Adobe PDF
236ac84d860a040e7fb2b114617c77f5