Deep neural networks have become ubiquitous in today's technological landscape, finding their way in a vast array of applications. Deep supervised learning, which relies on large labeled datasets, has been particularly successful in areas such as image classification. However, the effectiveness of these networks heavily depends on the quality of the data they can use. In most practical applications, obtaining high-quality labeled data is expensive, time-consuming, and sometimes even impossible, making the available dataset limited both in size and in quality, as the labeled data may contain noise due to various reasons such as human labeling errors. The main objective of this thesis is to develop practical methods measuring the quality of the generalization of deep neural networks in such settings with limited and/or noisy labeled data. We propose novel methods and metrics for estimating generalization, overfitting, and memorization throughout training, which are easy to deploy, which eliminate the need for a high-quality validation/test set and which optimize the use of the available data. First, we establish a connection between neural network output \emph{sensitivity} and variance in the bias-variance decomposition of the loss function. Through extensive empirical results, we show that sensitivity is strongly correlated with the test loss and can serve as a promising tool for selecting neural network architectures. We find that sensitivity is particularly effective in identifying the benefits of certain architectural choices, such as convolutional layers. Additionally, we promote sensitivity as a zero-cost metric that can estimate model generalization even before training. Our results show that sensitivity effectively captures the benefits of specific regularization and initialization techniques, such as batch normalization and Xavier parameter initialization.
Second, we introduce \emph{generalization penalty}, which measures how much a gradient step on one mini-batch negatively affects the performance on another mini-batch. From this, we derive a new metric called \emph{gradient disparity} and propose it as an early stopping criterion for deep neural networks trained with mini-batch gradient descent. Our extensive empirical experiments demonstrate that gradient disparity is strongly correlated with the generalization error in state-of-the-art configurations. Moreover, it is very efficient to use because of its low computational tractability. Gradient disparity even outperforms traditional validation methods such as $k$-fold cross-validation when the available data is limited, because it can use all available samples for training. When the available data has noisy labels, it signals overfitting better than the validation data. Third, we propose a metric called \emph{susceptibility} to evaluate neural network robustness against label noise memorization. Susceptibility is easy to compute during training and requires only unlabeled data, making it practical for real-world applications. We demonstrate its effectiveness in tracking memorization on various architectures and datasets, accurately distinguishing models that maintain low memorization on the training set. We also provide theoretical insights into the design of susceptibility as a metric for tracking memorization. We demonstrate its effectiveness through thorough experiments on several datasets with synthetic and real-world label noise.
EPFL_TH8988.pdf
n/a
openaccess
copyright
24.35 MB
Adobe PDF
555b5f53d37a189df4718d343ce25ccd