ABSTRACT
Nesterov's momentum trick is famously known for accelerating gradient descent, and has been proven useful in building fast iterative algorithms. However, in the stochastic setting, counterexamples exist and prevent Nesterov's momentum from providing similar acceleration, even if the underlying problem is convex.
We introduce Katyusha, a direct, primal-only stochastic gradient method to fix this issue. It has a provably accelerated convergence rate in convex (off-line) stochastic optimization. The main ingredient is Katyusha momentum, a novel "negative momentum" on top of Nesterov's momentum that can be incorporated into a variance-reduction based algorithm and speed it up. Since variance reduction has been successfully applied to a growing list of practical problems, our paper suggests that in each of such cases, one could potentially give Katyusha a hug.
Supplemental Material
- Zeyuan Allen-Zhu and Elad Hazan. Optimal Black-Box Reductions Between Optimization Objectives. In NIPS, 2016.Google ScholarDigital Library
- Zeyuan Allen-Zhu and Elad Hazan. Variance Reduction for Faster Non-Convex Optimization. In ICML, 2016. Google ScholarDigital Library
- Zeyuan Allen-Zhu, Yin Tat Lee, and Lorenzo Orecchia. Using optimization to obtain a width-independent, parallel, simpler, and faster positive SDP solver. In Proceedings of the 27th ACM-SIAM Symposium on Discrete Algorithms, SODA ’16, 2016. Google ScholarDigital Library
- Zeyuan Allen-Zhu and Yuanzhi Li. Doubly Accelerated Methods for Faster CCA and Generalized Eigendecomposition. ArXiv e-prints, abs/1607.06017, July 2016.Google Scholar
- Zeyuan Allen-Zhu and Yuanzhi Li. LazySVD: Even Faster SVD Decomposition Yet Without Agonizing Pain. In NIPS, 2016.Google ScholarDigital Library
- Zeyuan Allen-Zhu, Yuanzhi Li, Rafael Oliveira, and Avi Wigderson. Much faster algorithms for matrix scaling. ArXiv e-prints, abs/1704.02315, April 2017.Google Scholar
- Zeyuan Allen-Zhu, Zhenyu Liao, and Yang Yuan. Optimization Algorithms for Faster Computational Geometry. In ICALP, 2016.Google Scholar
- Zeyuan Allen-Zhu and Lorenzo Orecchia. Nearly-Linear Time Positive LP Solver with Faster Convergence Rate. In Proceedings of the 47th Annual ACM Symposium on Theory of Computing, STOC ’15, 2015. Google ScholarDigital Library
- Zeyuan Allen-Zhu and Lorenzo Orecchia. Using optimization to break the epsilon barrier: A faster and simpler width-independent algorithm for solving positive linear programs in parallel. In Proceedings of the 26th ACM-SIAM Symposium on Discrete Algorithms, SODA ’15, 2015. Google ScholarDigital Library
- 12 Like in all stochastic first-order methods, one can apply a Markov inequality to conclude that with probability at least 2 /3, Katyusha satisfies F (x S ) − F (x ∗ ) ≤ ε in the same stated asymptotic running time. STOC’17, June 2017, Montreal, Canada Zeyuan Allen-ZhuGoogle Scholar
- Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent. In Proceedings of the 8th Innovations in Theoretical Computer Science, ITCS ’17, 2017. Full version available at http://arxiv.org/abs/1407.1537.Google Scholar
- Zeyuan Allen-Zhu, Peter Richtárik, Zheng Qu, and Yang Yuan. Even faster accelerated coordinate descent using non-uniform sampling. In ICML, 2016. Google ScholarDigital Library
- Zeyuan Allen-Zhu and Yang Yuan. Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives. In ICML, 2016. Google ScholarDigital Library
- Léon Bottou. Stochastic gradient descent. http://leon.bottou.org/projects/sgd.Google Scholar
- Cong Dang and Guanghui Lan. Randomized First-Order Methods for Saddle Point Optimization. ArXiv e-prints, abs/1409.8625, sep 2014.Google Scholar
- Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In NIPS, 2014. Google ScholarDigital Library
- Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In ICML, volume 37, pages 1–28, 2015. Google ScholarDigital Library
- Dan Garber and Elad Hazan. Fast and simple PCA via convex optimization. ArXiv e-prints, September 2015.Google Scholar
- Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: Optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, 2014. Google ScholarDigital Library
- Chonghai Hu, Weike Pan, and James T Kwok. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems, pages 781–789, 2009. Google ScholarDigital Library
- Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, NIPS 2013, pages 315–323, 2013. Google ScholarDigital Library
- Jakub Konečn ` y, Jie Liu, Peter Richtárik, and Martin Takáč. Mini-batch semistochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 10(2):242–255, 2016.Google ScholarCross Ref
- Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, January 2011. Google ScholarDigital Library
- Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. ArXiv e-prints, abs/1507.02000, October 2015.Google Scholar
- Hongzhou Lin. private communication, 2016.Google Scholar
- Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A Universal Catalyst for First-Order Optimization. In NIPS, 2015. Google ScholarDigital Library
- Qihang Lin, Zhaosong Lu, and Lin Xiao. An Accelerated Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization. In NIPS, pages 3059–3067, 2014. Google ScholarDigital Library
- Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized blockcoordinate descent methods. Mathematical Programming, pages 1–28, 2013. Google ScholarDigital Library
- Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Mixed optimization for smooth functions. In Advances in Neural Information Processing Systems, pages 674–682, 2013. Google ScholarDigital Library
- Julien Mairal. Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning. SIAM Journal on Optimization, 25(2):829– 855, April 2015. Preliminary version appeared in ICML 2013. Google ScholarDigital Library
- Yurii Nesterov. A method of solving a convex programming problem with convergence rate O(1/k 2 ). In Doklady AN SSSR (translated as Soviet Mathematics Doklady), volume 269, pages 543–547, 1983.Google Scholar
- Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, volume I. Kluwer Academic Publishers, 2004. Google ScholarDigital Library
- Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127–152, December 2005. Google ScholarDigital Library
- Yurii Nesterov. Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems. SIAM Journal on Optimization, 22(2):341–362, jan 2012.Google ScholarCross Ref
- Atsushi Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems, pages 1574–1582, 2014. Google ScholarDigital Library
- Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, 2012. Google ScholarDigital Library
- Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388, pages 1–45, 2013.Google Scholar
- Preliminary version appeared in NIPS 2012.Google Scholar
- Shai Shalev-Shwartz. SDCA without Duality. arXiv preprint arXiv:1502.06177, pages 1–7, 2015.Google Scholar
- Shai Shalev-Shwartz and Tong Zhang. Proximal Stochastic Dual Coordinate Ascent. arXiv preprint arXiv:1211.2717, pages 1–18, 2012.Google Scholar
- Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567– 599, 2013. Google ScholarDigital Library
- Shai Shalev-Shwartz and Tong Zhang. Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization. In Proceedings of the 31st International Conference on Machine Learning, ICML 2014, pages 64–72, 2014. Google ScholarDigital Library
- Blake Woodworth and Nati Srebro. Tight Complexity Bounds for Optimizing Composite Objectives. In NIPS, 2016.Google ScholarDigital Library
- Lin Xiao and Tong Zhang. A Proximal Stochastic Gradient Method with Progressive Variance Reduction. SIAM Journal on Optimization, 24(4):2057—-2075, 2014.Google ScholarDigital Library
- Lijun Zhang, Mehrdad Mahdavi, and Rong Jin. Linear convergence with condition number independent access of full gradients. In Advances in Neural Information Processing Systems, pages 980–988, 2013. Google ScholarDigital Library
- Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the 21st International Conference on Machine Learning, ICML 2004, 2004. Google ScholarDigital Library
- Yuchen Zhang and Lin Xiao. Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization. In ICML, 2015. Google ScholarDigital Library
Index Terms
- Katyusha: the first direct acceleration of stochastic gradient methods
Recommendations
Accelerating variance-reduced stochastic gradient methods
AbstractVariance reduction is a crucial tool for improving the slow convergence of stochastic gradient descent. Only a few variance-reduced methods, however, have yet been shown to directly benefit from Nesterov’s acceleration techniques to match the ...
Accelerated First-Order Primal-Dual Proximal Methods for Linearly Constrained Composite Convex Programming
Motivated by big data applications, first-order methods have been extremely popular in recent years. However, naive gradient methods generally converge slowly. Hence, much effort has been made to accelerate various first-order methods. This paper proposes ...
Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods
AbstractIn this paper we study several classes of stochastic optimization algorithms enriched with heavy ball momentum. Among the methods studied are: stochastic gradient descent, stochastic Newton, stochastic proximal point and stochastic dual subspace ...
Comments