On SGD with Momentum

Plattner, Maximilian

master thesis

2022

Stochastic Gradient Descent (SGD) is the workhorse for training large-scale machine learning applications. Although the convergence rate of its deterministic counterpart, Gradient Descent (GD), can be shown to be accelerated by adaptations that use the notion of momentum, e.g., Heavy Ball (HB) or Nesterov Accelerated Gradient (NAG), the theory could not prove, by means of local convergence analysis, that such modifications provide faster convergence rates in the stochastic setting. This work empirically establishes that a positive momentum coefficient in SGD has the effect of enlarging the algorithm's learning rate, not contributing to a boost in performance per se. For the deep learning setting, however, this enlargement tends to be conducted in a way robust to unfavorable initialization points. Given these findings, this work derives a heuristic, the Momentum Linear Scaling Rule (MLSR), to transfer from a small-batch setting to a large-batch setting in deep learning while approximately maintaining the same generalization performance.

Name

plattner_sgdm_thesis.pdf

Type

Publisher

Version

http://purl.org/coar/version/c_970fb48d4fbd8a85

Access type

openaccess

License Condition

copyright

Size

5.4 MB

Format

Adobe PDF

Checksum (MD5)

46ee88f91633346a13f782f2b593fb9f