research-article

Public Access

Katyusha: the first direct acceleration of stochastic gradient methods

Author:
Zeyuan Allen-Zhu

IAS, USA / Princeton University, USA

IAS, USA / Princeton University, USA
View Profile

STOC 2017: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of ComputingJune 2017Pages 1200–1205https://doi.org/10.1145/3055399.3055448

Published:19 June 2017Publication History

STOC 2017: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

Pages 1200–1205

ABSTRACT

Nesterov's momentum trick is famously known for accelerating gradient descent, and has been proven useful in building fast iterative algorithms. However, in the stochastic setting, counterexamples exist and prevent Nesterov's momentum from providing similar acceleration, even if the underlying problem is convex.

We introduce Katyusha, a direct, primal-only stochastic gradient method to fix this issue. It has a provably accelerated convergence rate in convex (off-line) stochastic optimization. The main ingredient is Katyusha momentum, a novel "negative momentum" on top of Nesterov's momentum that can be incorporated into a variance-reduction based algorithm and speed it up. Since variance reduction has been successfully applied to a growing list of practical problems, our paper suggests that in each of such cases, one could potentially give Katyusha a hug.

Supplemental Material

d4_sb_t7.mp4

mp4

142.9 MB

Download

References

Zeyuan Allen-Zhu and Elad Hazan. Optimal Black-Box Reductions Between Optimization Objectives. In NIPS, 2016.Google ScholarDigital Library
Zeyuan Allen-Zhu and Elad Hazan. Variance Reduction for Faster Non-Convex Optimization. In ICML, 2016. Google ScholarDigital Library
Zeyuan Allen-Zhu, Yin Tat Lee, and Lorenzo Orecchia. Using optimization to obtain a width-independent, parallel, simpler, and faster positive SDP solver. In Proceedings of the 27th ACM-SIAM Symposium on Discrete Algorithms, SODA ’16, 2016. Google ScholarDigital Library
Zeyuan Allen-Zhu and Yuanzhi Li. Doubly Accelerated Methods for Faster CCA and Generalized Eigendecomposition. ArXiv e-prints, abs/1607.06017, July 2016.Google Scholar
Zeyuan Allen-Zhu and Yuanzhi Li. LazySVD: Even Faster SVD Decomposition Yet Without Agonizing Pain. In NIPS, 2016.Google ScholarDigital Library
Zeyuan Allen-Zhu, Yuanzhi Li, Rafael Oliveira, and Avi Wigderson. Much faster algorithms for matrix scaling. ArXiv e-prints, abs/1704.02315, April 2017.Google Scholar
Zeyuan Allen-Zhu, Zhenyu Liao, and Yang Yuan. Optimization Algorithms for Faster Computational Geometry. In ICALP, 2016.Google Scholar
Zeyuan Allen-Zhu and Lorenzo Orecchia. Nearly-Linear Time Positive LP Solver with Faster Convergence Rate. In Proceedings of the 47th Annual ACM Symposium on Theory of Computing, STOC ’15, 2015. Google ScholarDigital Library
Zeyuan Allen-Zhu and Lorenzo Orecchia. Using optimization to break the epsilon barrier: A faster and simpler width-independent algorithm for solving positive linear programs in parallel. In Proceedings of the 26th ACM-SIAM Symposium on Discrete Algorithms, SODA ’15, 2015. Google ScholarDigital Library
12 Like in all stochastic first-order methods, one can apply a Markov inequality to conclude that with probability at least 2 /3, Katyusha satisfies F (x S ) − F (x ∗ ) ≤ ε in the same stated asymptotic running time. STOC’17, June 2017, Montreal, Canada Zeyuan Allen-ZhuGoogle Scholar
Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent. In Proceedings of the 8th Innovations in Theoretical Computer Science, ITCS ’17, 2017. Full version available at http://arxiv.org/abs/1407.1537.Google Scholar
Zeyuan Allen-Zhu, Peter Richtárik, Zheng Qu, and Yang Yuan. Even faster accelerated coordinate descent using non-uniform sampling. In ICML, 2016. Google ScholarDigital Library
Zeyuan Allen-Zhu and Yang Yuan. Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives. In ICML, 2016. Google ScholarDigital Library
Léon Bottou. Stochastic gradient descent. http://leon.bottou.org/projects/sgd.Google Scholar
Cong Dang and Guanghui Lan. Randomized First-Order Methods for Saddle Point Optimization. ArXiv e-prints, abs/1409.8625, sep 2014.Google Scholar
Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In NIPS, 2014. Google ScholarDigital Library
Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In ICML, volume 37, pages 1–28, 2015. Google ScholarDigital Library
Dan Garber and Elad Hazan. Fast and simple PCA via convex optimization. ArXiv e-prints, September 2015.Google Scholar
Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: Optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, 2014. Google ScholarDigital Library
Chonghai Hu, Weike Pan, and James T Kwok. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems, pages 781–789, 2009. Google ScholarDigital Library
Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, NIPS 2013, pages 315–323, 2013. Google ScholarDigital Library
Jakub Konečn ` y, Jie Liu, Peter Richtárik, and Martin Takáč. Mini-batch semistochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 10(2):242–255, 2016.Google ScholarCross Ref
Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, January 2011. Google ScholarDigital Library
Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. ArXiv e-prints, abs/1507.02000, October 2015.Google Scholar
Hongzhou Lin. private communication, 2016.Google Scholar
Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A Universal Catalyst for First-Order Optimization. In NIPS, 2015. Google ScholarDigital Library
Qihang Lin, Zhaosong Lu, and Lin Xiao. An Accelerated Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization. In NIPS, pages 3059–3067, 2014. Google ScholarDigital Library
Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized blockcoordinate descent methods. Mathematical Programming, pages 1–28, 2013. Google ScholarDigital Library
Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Mixed optimization for smooth functions. In Advances in Neural Information Processing Systems, pages 674–682, 2013. Google ScholarDigital Library
Julien Mairal. Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning. SIAM Journal on Optimization, 25(2):829– 855, April 2015. Preliminary version appeared in ICML 2013. Google ScholarDigital Library
Yurii Nesterov. A method of solving a convex programming problem with convergence rate O(1/k 2 ). In Doklady AN SSSR (translated as Soviet Mathematics Doklady), volume 269, pages 543–547, 1983.Google Scholar
Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, volume I. Kluwer Academic Publishers, 2004. Google ScholarDigital Library
Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127–152, December 2005. Google ScholarDigital Library
Yurii Nesterov. Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems. SIAM Journal on Optimization, 22(2):341–362, jan 2012.Google ScholarCross Ref
Atsushi Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems, pages 1574–1582, 2014. Google ScholarDigital Library
Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, 2012. Google ScholarDigital Library
Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388, pages 1–45, 2013.Google Scholar
Preliminary version appeared in NIPS 2012.Google Scholar
Shai Shalev-Shwartz. SDCA without Duality. arXiv preprint arXiv:1502.06177, pages 1–7, 2015.Google Scholar
Shai Shalev-Shwartz and Tong Zhang. Proximal Stochastic Dual Coordinate Ascent. arXiv preprint arXiv:1211.2717, pages 1–18, 2012.Google Scholar
Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567– 599, 2013. Google ScholarDigital Library
Shai Shalev-Shwartz and Tong Zhang. Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization. In Proceedings of the 31st International Conference on Machine Learning, ICML 2014, pages 64–72, 2014. Google ScholarDigital Library
Blake Woodworth and Nati Srebro. Tight Complexity Bounds for Optimizing Composite Objectives. In NIPS, 2016.Google ScholarDigital Library
Lin Xiao and Tong Zhang. A Proximal Stochastic Gradient Method with Progressive Variance Reduction. SIAM Journal on Optimization, 24(4):2057—-2075, 2014.Google ScholarDigital Library
Lijun Zhang, Mehrdad Mahdavi, and Rong Jin. Linear convergence with condition number independent access of full gradients. In Advances in Neural Information Processing Systems, pages 980–988, 2013. Google ScholarDigital Library
Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the 21st International Conference on Machine Learning, ICML 2004, 2004. Google ScholarDigital Library
Yuchen Zhang and Lin Xiao. Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization. In ICML, 2015. Google ScholarDigital Library

Index Terms

Katyusha: the first direct acceleration of stochastic gradient methods
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
        Supervised learning by regression
2. Theory of computation
  1. Design and analysis of algorithms
    1. Approximation algorithms analysis
      1. Stochastic approximation
    2. Mathematical optimization
      1. Continuous optimization
        Convex optimization
        Nonconvex optimization
        Stochastic control and optimization
  2. Theory and algorithms for application domains
    1. Machine learning theory

Recommendations

Accelerating variance-reduced stochastic gradient methods
Abstract
Variance reduction is a crucial tool for improving the slow convergence of stochastic gradient descent. Only a few variance-reduced methods, however, have yet been shown to directly benefit from Nesterov’s acceleration techniques to match the ...
Read More
Accelerated First-Order Primal-Dual Proximal Methods for Linearly Constrained Composite Convex Programming

Motivated by big data applications, first-order methods have been extremely popular in recent years. However, naive gradient methods generally converge slowly. Hence, much effort has been made to accelerate various first-order methods. This paper proposes ...
Read More
Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods
Abstract
In this paper we study several classes of stochastic optimization algorithms enriched with heavy ball momentum. Among the methods studied are: stochastic gradient descent, stochastic Newton, stochastic proximal point and stochastic dual subspace ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
STOC 2017: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
June 2017
1268 pages
ISBN:9781450345286
DOI:10.1145/3055399
General Chairs:
Hamed Hatami
McGill University
,
Pierre McKenzie
Université de Montréal
,
Program Chair:
Valerie King
University of Victoria
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 June 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
accelerated gradient descent
acceleration
first-order method
momentum
stochastic gradient descent
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,469of4,586submissions,32%
Upcoming Conference
STOC '24

Sponsor:

sigact

56th Annual ACM Symposium on Theory of Computing (STOC 2024)

June 24 - 28, 2024

Vancouver , BC , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 102
  Total Citations
  View Citations
- 788
  Total Downloads
- Downloads (Last 12 months)341
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Katyusha: the first direct acceleration of stochastic gradient methods

STOC 2017: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Accelerating variance-reduced stochastic gradient methods

Accelerated First-Order Primal-Dual Proximal Methods for Linearly Constrained Composite Convex Programming

Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods