On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems
This paper analyzes the trajectories of stochastic gradient descent (SGD) to help understand the algorithm’s convergence properties in non-convex problems. We first show that the sequence of iterates generated by SGD remains bounded and converges with probability 1 under a very broad range of step-size schedules. Subsequently, going beyond existing positive probability guarantees, we show that SGD avoids strict saddle points/manifolds with probability 1 for the entire spectrum of step-size policies considered. Finally, we prove that the algorithm’s rate of convergence to local minimizers with a positive-definite Hessian is $O(1/n^p)$ if the method is employed with a $\Theta(1/n^p)$ step-size. This provides an important guideline for tuning the algorithm’s step-size as it suggests that a cool-down phase with a vanishing step-size could lead to faster convergence; we demonstrate this heuristic using ResNet architectures on CIFAR.
on-the-almost-sure-convergence-of-stochastic-gradient-descent-in-non-convex-problems (1).pdf
Postprint
openaccess
Copyright
1.17 MB
Adobe PDF
fbb594050445761ef1d45503d5d5046c