Files

Abstract

Neural networks (NNs) have been very successful in a variety of tasks ranging from machine translation to image classification. Despite their success, the reasons for their performance are still not well-understood. This thesis explores two main themes: loss landscapes and symmetries present in data. Machine learning consists of training models on data by optimizing the model parameters. This optimization is done by minimizing a loss function. NNs, a family of machine learning models, are created by composing functions, called layers. Informally, they can be visualized as a set of interconnected neurons. Ten years ago, NNs became the most popular models of machine learning. With their success come many open questions. For example, neural networks and glassy systems both have many degrees of freedom and highly non-convex objective or energy functions, respectively. However, glassy systems get stuck in local minima near where they are initialized, whereas neural networks avoid this even when they 100s of times more parameters than the number of data use to train them? (i) What drives this difference in behavior? (ii) How is it then that NNs do not become too specialized to the training data (overfitting)? In the first part of this thesis, we show that in classification tasks, NNs undergo a jamming transition dependent on the number of parameters, $N$. This answers (i): With a sufficiently high $N$ above a critical number $N^*$, local minima are avoided. Then, we establish a "double-descent" behavior in the test error of classification tasks: It decreases twice as a function of $N$, before $N^*$ but also after, until infinity, where it converges to its minimum. We answer (ii) by explaining the origins of this double-descent. Finally, we introduce a phase diagram that describes the landscape of the loss function and unifies the two limits in which a neural network can converge when sending $N$ to infinity. In the second part of this thesis, we explore the issue of the curse of dimensionality (CD): Sampling a $d$-dimensional space requires an exponential number of points $P$. However, NNs perform well even for $P \ll \exp(d)$. Symmetries in the data play a role in this conundrum. For example, to process images we use convolutional NNs (CNNs) which have the property of being locally connected and equivariant with respect to translations, i.e., a translation in the input leads to a corresponding translation in the output. Although empirical experience suggests that locality and equivariance contribute to the success of CNNs, it is difficult to understand how. Indeed, equivariance reduces the dimensionality of the data only slightly. Stability toward diffeomorphisms however might be the key to CD. We studied how NNs are affected by images distorted by diffeomorphisms. Our results suggest that locality and equivariance allow, during learning, to develop stability towards diffeomorphisms \textit{relative} to other generic transformations. Following this intuition, we have created new architectures by extending CNNs properties to 3D rotations. Our work contributes to the current understanding of the behavior of neural networks empirically observed by machine learning practitioners. Moreover, the architectures developed for 3D rotation problems are currently being applied to a wide range of domains.

Details

PDF