Don't Use Large Mini-Batches, Use Local SGD

Lin, Tao; Stich, Sebastian Urban; Patel, Kumar Kshitij; Jaggi, Martin

conference paper

Don't Use Large Mini-Batches, Use Local SGD

Lin, Tao

•

Stich, Sebastian Urban

•

Patel, Kumar Kshitij

more

2019

Proceedings of the 8th International Conference on Learning Representations

ICLR 2020 8th International Conference on Learning Representations

Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they do not show good accuracy on new data. As a remedy, we propose a \emph{post-local} SGD and show that it significantly improves the generalization performance compared to large-batch training on standard benchmarks while enjoying the same efficiency (time-to-accuracy) and scalability. We further provide an extensive study of the communication efficiency vs. performance trade-offs associated with a host of \emph{local SGD} variants.

Use this identifier to reference this record

https://infoscience.epfl.ch/handle/20.500.14299/179476

Name

1808.07217.pdf

Type

Publisher's version

Access type

openaccess

License Condition

Copyright

Size

4.02 MB

Format

Adobe PDF

Checksum (MD5)

fbac3d0d7c1c71b12d003208cd1ee4fa