The interest for distributed stochastic optimization has raised to train complex Machine Learning models with more data on distributed systems. Increasing the computation power speeds up the training but it faces a communication bottleneck between workers which hurts the scale-up of these distributed algorithms. Previous work tried to address this issue through quantization by broadcasting low precision updates (Seide et al., 2014; Alistarh et al., 2017) or through sparsification by sharing only the most important coordinates to update (Aji and Heafield, 2017; Lin et al., 2018). Even though the sparsifica- tion method works well in practice, it lacked a theoretical proof until now. We propose a sparsification scheme for SGD where only a small constant number of coordinates are applied at each iteration. The amount of data to be communicated is drastically reduced while the O(1/T ) convergence rate of vanilla SGD is preserved. Our concise proof extends to parallel setting and gains a linear speed up in the number of workers. We exper- iment with sparsified SGD with memory in C++ and Python, and show the excellent convergence properties for top-k and random-k operators. Our scheme outperforms QSGD in progress per number of bits sent. It also opens the path to using lock-free asynchronous parallelization (e.g. Hogwild! Niu et al. (2011)) on dense problems as the sparsity of the gradient updates is enforced.