ABSTRACT
We provide a novel algorithm to approximately factor large matrices with millions of rows, millions of columns, and billions of nonzero elements. Our approach rests on stochastic gradient descent (SGD), an iterative stochastic optimization algorithm. We first develop a novel "stratified" SGD variant (SSGD) that applies to general loss-minimization problems in which the loss function can be expressed as a weighted sum of "stratum losses." We establish sufficient conditions for convergence of SSGD using results from stochastic approximation theory and regenerative process theory. We then specialize SSGD to obtain a new matrix-factorization algorithm, called DSGD, that can be fully distributed and run on web-scale datasets using, e.g., MapReduce. DSGD can handle a wide variety of matrix factorizations. We describe the practical techniques used to optimize performance in our DSGD implementation. Experiments suggest that DSGD converges significantly faster and has better scalability properties than alternative algorithms.
- Apache Hadoop. https://hadoop.apache.org.Google Scholar
- S. Asmussen. Applied Probability and Queues. Springer, 2nd edition, 2003.Google Scholar
- J. Bennett and S. Lanning. The Netflix prize. In KDD Cup and Workshop, 2007.Google Scholar
- C. M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, 2007. Google ScholarDigital Library
- L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In NIPS, volume 20, pages 161--168. 2008.Google ScholarDigital Library
- R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput., 16(5):1190--1208, 1995. Google ScholarDigital Library
- Y. S. Chow and H. Teicher. Probability Theory: Independence, Interchangeability, Martingales. Springer, 2nd edition, 1988.Google Scholar
- A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW, pages 271--280, 2007. Google ScholarDigital Library
- S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: Integrating R and Hadoop. In SIGMOD, pages 987--998, 2010. Google ScholarDigital Library
- R. Gemulla, P. J. Haas, E. Nijkamp, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. Technical Report RJ10481, IBM Almaden Research Center, San Jose, CA, 2011. Available at www.almaden.ibm.com/cs/people/peterh/dsgdTechRep.pdf.Google ScholarDigital Library
- K. B. Hall, S. Gilpin, and G. Mann. MapReduce/Bigtable for distributed optimization. In NIPS LCCC Workshop, 2010.Google Scholar
- T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57, 1999. Google ScholarDigital Library
- Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer, 42(8):30--37, 2009. Google ScholarDigital Library
- H. J. Kushner and G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer, 2nd edition, 2003.Google Scholar
- D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788--791, 1999.Google ScholarCross Ref
- C. Liu, H.-c. Yang, J. Fan, L.-W. He, and Y.-M. Wang. Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce. In WWW, pages 681--690, 2010. Google ScholarDigital Library
- G. Mann, R. McDonald, M. Mohri, N. Silberman, and D. Walker. Efficient large-scale distributed training of conditional maximum entropy models. In NIPC, pages 1231--1239. 2009.Google Scholar
- R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the structured perceptron. In HLT, pages 456--464, 2010. Google ScholarDigital Library
- A. P. Singh and G. J. Gordon. A unified view of matrix factorization models. In ECML PKDD, pages 358--373, 2008. Google ScholarDigital Library
- Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the Netflix Prize. In AAIM, pages 337--348, 2008. Google ScholarDigital Library
- M. A. Zinkevich, M. Weimer, A. J. Smola, and L. Li. Parallelized stochastic gradient descent. In NIPS, pages 2595--2603, 2010.Google ScholarDigital Library
Index Terms
- Large-scale matrix factorization with distributed stochastic gradient descent
Recommendations
Shared-memory and shared-nothing stochastic gradient descent algorithms for matrix completion
We provide parallel algorithms for large-scale matrix completion on problems with millions of rows, millions of columns, and billions of revealed entries. We focus on in-memory algorithms that run either in a shared-memory environment on a powerful ...
Stochastic gradient descent possibilistic clustering
SETN 2020: 11th Hellenic Conference on Artificial IntelligenceAlthough online versions of several well known clustering algorithms have been proposed, in order to deal effectively with the big data issue, as well as with the case where the data are available in a streaming fashion, very few of them follow the ...
Asynchronous parallel stochastic gradient descent: a numeric core for scalable distributed machine learning algorithms
MLHPC '15: Proceedings of the Workshop on Machine Learning in High-Performance Computing EnvironmentsThe implementation of a vast majority of machine learning (ML) algorithms boils down to solving a numerical optimization problem. In this context, Stochastic Gradient Descent (SGD) methods have long proven to provide good results, both in terms of ...
Comments