How data structures affect generalization in Kernel Methods and Deep Learning
Artificial Intelligence has revolutionized numerous fields, driving advancements in healthcare, finance, and autonomous systems. At the core of this revolution lies Deep Learning---algorithms that learn representations from data through computational layers of artificial neurons.
For deep learning to succeed, it must extract meaningful information from data rather than merely memorizing it. Otherwise, it would require exponentially large training datasets to learn a task---an impractical scenario called curse of dimensionality. Deep networks must overcome this by extracting relevant features from the structure of the training data, enabling them to generalize to new test data.
What types of data structures do deep networks learn? How do they use these structures to overcome the curse of dimensionality? To investigate these questions, we employ toy models that abstract real-world data features, allowing for theoretical analysis. These simplified models capture the phenomena observed on real data, offering insights into their behavior and generalization.
In a simple classification task where data are clustered by labels, existing theories predict the performance of deep networks, in a regime where they do not learn features. However, these theories are justified for high-dimensional data, and their validity in lower dimensions is uncertain. In low dimensions, we identify a crossover between the success and failure of these theories, also observed in real-world data. Nevertheless, on this task, deep networks exhibit poor generalization, constrained by the curse of dimensionality.
To explain their performance on real data, we consider additional structures that networks can leverage in the feature-learning regime. Our hypothesis is that these structures introduce invariances, enabling deep networks to simplify tasks. For instance, in image classification, recognizing a dog does not require to match an idealized representation; a distorted image is often sufficient. Recent findings suggest that the best-performing networks are those less sensitive to small deformations. We hypothesize this is because images are composed of local and sparse features like edges and patterns, whose exact relative positions are irrelevant for classification. This irrelevance introduces invariance to small deformations. To test this, we construct toy models of local and sparse features, and quantify how invariance to small deformations develops with training.
To get good performance, we hypothesize that deep networks combine these features to infer the meaning of images in a hierarchical way. For example, recognizing a dog involves assembling edges into limbs and facial components, which combine to form a complete structure. Such hierarchical structures allow for synonymic variations---for instance, whether the dog's eyes are open or closed---without altering the task.
To explore whether deep networks learn invariances from hierarchical structures and relate them to invariance to small deformations, we introduce synthetic datasets exhibiting both invariances. Our findings show that deep networks learn these invariances layer by layer, using exactly the same training set size required to learn the task. Our theoretical investigation quantifies this number, and demonstrates how deep networks are able to generalize well by constructing internal representations increasingly insensitive to the task-specific invariances.
EPFL_TH11061.pdf
Main Document
openaccess
N/A
12.49 MB
Adobe PDF
06d3f619c5e7b73aa1dd988cccf44252