skip to main content
10.1145/3055399.3055448acmconferencesArticle/Chapter ViewAbstractPublication PagesstocConference Proceedingsconference-collections
research-article
Public Access

Katyusha: the first direct acceleration of stochastic gradient methods

Published:19 June 2017Publication History

ABSTRACT

Nesterov's momentum trick is famously known for accelerating gradient descent, and has been proven useful in building fast iterative algorithms. However, in the stochastic setting, counterexamples exist and prevent Nesterov's momentum from providing similar acceleration, even if the underlying problem is convex.

We introduce Katyusha, a direct, primal-only stochastic gradient method to fix this issue. It has a provably accelerated convergence rate in convex (off-line) stochastic optimization. The main ingredient is Katyusha momentum, a novel "negative momentum" on top of Nesterov's momentum that can be incorporated into a variance-reduction based algorithm and speed it up. Since variance reduction has been successfully applied to a growing list of practical problems, our paper suggests that in each of such cases, one could potentially give Katyusha a hug.

Skip Supplemental Material Section

Supplemental Material

d4_sb_t7.mp4

mp4

142.9 MB

References

  1. Zeyuan Allen-Zhu and Elad Hazan. Optimal Black-Box Reductions Between Optimization Objectives. In NIPS, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Zeyuan Allen-Zhu and Elad Hazan. Variance Reduction for Faster Non-Convex Optimization. In ICML, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Zeyuan Allen-Zhu, Yin Tat Lee, and Lorenzo Orecchia. Using optimization to obtain a width-independent, parallel, simpler, and faster positive SDP solver. In Proceedings of the 27th ACM-SIAM Symposium on Discrete Algorithms, SODA ’16, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Zeyuan Allen-Zhu and Yuanzhi Li. Doubly Accelerated Methods for Faster CCA and Generalized Eigendecomposition. ArXiv e-prints, abs/1607.06017, July 2016.Google ScholarGoogle Scholar
  5. Zeyuan Allen-Zhu and Yuanzhi Li. LazySVD: Even Faster SVD Decomposition Yet Without Agonizing Pain. In NIPS, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Zeyuan Allen-Zhu, Yuanzhi Li, Rafael Oliveira, and Avi Wigderson. Much faster algorithms for matrix scaling. ArXiv e-prints, abs/1704.02315, April 2017.Google ScholarGoogle Scholar
  7. Zeyuan Allen-Zhu, Zhenyu Liao, and Yang Yuan. Optimization Algorithms for Faster Computational Geometry. In ICALP, 2016.Google ScholarGoogle Scholar
  8. Zeyuan Allen-Zhu and Lorenzo Orecchia. Nearly-Linear Time Positive LP Solver with Faster Convergence Rate. In Proceedings of the 47th Annual ACM Symposium on Theory of Computing, STOC ’15, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Zeyuan Allen-Zhu and Lorenzo Orecchia. Using optimization to break the epsilon barrier: A faster and simpler width-independent algorithm for solving positive linear programs in parallel. In Proceedings of the 26th ACM-SIAM Symposium on Discrete Algorithms, SODA ’15, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 12 Like in all stochastic first-order methods, one can apply a Markov inequality to conclude that with probability at least 2 /3, Katyusha satisfies F (x S ) − F (x ∗ ) ≤ ε in the same stated asymptotic running time. STOC’17, June 2017, Montreal, Canada Zeyuan Allen-ZhuGoogle ScholarGoogle Scholar
  11. Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent. In Proceedings of the 8th Innovations in Theoretical Computer Science, ITCS ’17, 2017. Full version available at http://arxiv.org/abs/1407.1537.Google ScholarGoogle Scholar
  12. Zeyuan Allen-Zhu, Peter Richtárik, Zheng Qu, and Yang Yuan. Even faster accelerated coordinate descent using non-uniform sampling. In ICML, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Zeyuan Allen-Zhu and Yang Yuan. Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives. In ICML, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Léon Bottou. Stochastic gradient descent. http://leon.bottou.org/projects/sgd.Google ScholarGoogle Scholar
  15. Cong Dang and Guanghui Lan. Randomized First-Order Methods for Saddle Point Optimization. ArXiv e-prints, abs/1409.8625, sep 2014.Google ScholarGoogle Scholar
  16. Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In NIPS, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In ICML, volume 37, pages 1–28, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Dan Garber and Elad Hazan. Fast and simple PCA via convex optimization. ArXiv e-prints, September 2015.Google ScholarGoogle Scholar
  19. Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: Optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Chonghai Hu, Weike Pan, and James T Kwok. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems, pages 781–789, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, NIPS 2013, pages 315–323, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jakub Konečn ` y, Jie Liu, Peter Richtárik, and Martin Takáč. Mini-batch semistochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 10(2):242–255, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  23. Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, January 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. ArXiv e-prints, abs/1507.02000, October 2015.Google ScholarGoogle Scholar
  25. Hongzhou Lin. private communication, 2016.Google ScholarGoogle Scholar
  26. Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A Universal Catalyst for First-Order Optimization. In NIPS, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Qihang Lin, Zhaosong Lu, and Lin Xiao. An Accelerated Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization. In NIPS, pages 3059–3067, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized blockcoordinate descent methods. Mathematical Programming, pages 1–28, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Mixed optimization for smooth functions. In Advances in Neural Information Processing Systems, pages 674–682, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Julien Mairal. Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning. SIAM Journal on Optimization, 25(2):829– 855, April 2015. Preliminary version appeared in ICML 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Yurii Nesterov. A method of solving a convex programming problem with convergence rate O(1/k 2 ). In Doklady AN SSSR (translated as Soviet Mathematics Doklady), volume 269, pages 543–547, 1983.Google ScholarGoogle Scholar
  32. Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, volume I. Kluwer Academic Publishers, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127–152, December 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Yurii Nesterov. Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems. SIAM Journal on Optimization, 22(2):341–362, jan 2012.Google ScholarGoogle ScholarCross RefCross Ref
  35. Atsushi Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems, pages 1574–1582, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388, pages 1–45, 2013.Google ScholarGoogle Scholar
  38. Preliminary version appeared in NIPS 2012.Google ScholarGoogle Scholar
  39. Shai Shalev-Shwartz. SDCA without Duality. arXiv preprint arXiv:1502.06177, pages 1–7, 2015.Google ScholarGoogle Scholar
  40. Shai Shalev-Shwartz and Tong Zhang. Proximal Stochastic Dual Coordinate Ascent. arXiv preprint arXiv:1211.2717, pages 1–18, 2012.Google ScholarGoogle Scholar
  41. Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567– 599, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Shai Shalev-Shwartz and Tong Zhang. Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization. In Proceedings of the 31st International Conference on Machine Learning, ICML 2014, pages 64–72, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Blake Woodworth and Nati Srebro. Tight Complexity Bounds for Optimizing Composite Objectives. In NIPS, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Lin Xiao and Tong Zhang. A Proximal Stochastic Gradient Method with Progressive Variance Reduction. SIAM Journal on Optimization, 24(4):2057—-2075, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Lijun Zhang, Mehrdad Mahdavi, and Rong Jin. Linear convergence with condition number independent access of full gradients. In Advances in Neural Information Processing Systems, pages 980–988, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the 21st International Conference on Machine Learning, ICML 2004, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Yuchen Zhang and Lin Xiao. Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization. In ICML, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Katyusha: the first direct acceleration of stochastic gradient methods

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Conferences
                  STOC 2017: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
                  June 2017
                  1268 pages
                  ISBN:9781450345286
                  DOI:10.1145/3055399

                  Copyright © 2017 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 19 June 2017

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article

                  Acceptance Rates

                  Overall Acceptance Rate1,469of4,586submissions,32%

                  Upcoming Conference

                  STOC '24
                  56th Annual ACM Symposium on Theory of Computing (STOC 2024)
                  June 24 - 28, 2024
                  Vancouver , BC , Canada

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader