skip to main content
10.1145/2339530.2339559acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Linear support vector machines via dual cached loops

Published:12 August 2012Publication History

ABSTRACT

Modern computer hardware offers an elaborate hierarchy of storage subsystems with different speeds, capacities, and costs associated with them. Furthermore, processors are now inherently parallel offering the execution of several diverse threads simultaneously. This paper proposes StreamSVM, the first algorithm for training linear Support Vector Machines (SVMs) which takes advantage of these properties by integrating caching with optimization. StreamSVM works by performing updates in the dual, thus obviating the need to rebalance frequently visited examples. Furthermore we trade off file I/O with data expansion on the fly by generating features on demand. This significantly increases throughput. Experiments show that StreamSVM outperforms other linear SVM solvers, including the award winning work of [38], by orders of magnitude and produces more accurate solutions within a shorter amount of time.

Skip Supplemental Material Section

Supplemental Material

311b_m_talk_5.mp4

mp4

121.1 MB

References

  1. A. Agarwal, O. Chapelle, M. Dudík, and J. Langford. A reliable effective terascale linear learning system. Technical report, 2011. http://arxiv.org/abs/1110.4198.Google ScholarGoogle Scholar
  2. A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. Scalable inference in latent variable models. In WSDM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. Bakir, T. Hofmann, B. Schölkopf, A. Smola, B. Taskar, and S. V. N. Vishwanathan. Predicting Structured Data. 2007.Google ScholarGoogle Scholar
  4. A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classifiers with online and active learning. JMLR, 6:1579--1619, September 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Bottou. Stochastic gradient SVMs. http://leon.bottou.org/projects/sgd, 2008.Google ScholarGoogle Scholar
  6. L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In NIPS 20, 2007.Google ScholarGoogle Scholar
  7. O. Bousquet and A. Elisseeff. Algorithmic stability and generalization performance. In NIPS 12, pages 196--202, 2001.Google ScholarGoogle Scholar
  8. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Candes and T. Tao. Decoding by linear programming. IEEE Trans. Information Theory, 51(12):4203--4215, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. C. Chang and C. J. Lin. LIBSVM: a library for support vector machines, 2001. http://www.csie.ntu.edu.tw/~cjlin/libsvm. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. K. W. Chang and D. Roth. Selective block minimization for faster convergence of limited memory large-scale linear models. In KDD, pages 699--707, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. L. Cheng, S. V. N. Vishwanathan, D. Schuurmans, S. Wang, and T. Caelli. Implicit online learning with kernels. In NIPS 19, 2006.Google ScholarGoogle Scholar
  13. C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20(3):273--297, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. K. Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms. JMLR, 7:551--585, March 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. Technical report, 2010. http://arxiv.org/abs/1012.1367.Google ScholarGoogle Scholar
  16. D. L. Donoho, A. Maleki, and A. Montanari. Message-passing algorithms for compressed sensing. PNAS, 106, 2009.Google ScholarGoogle Scholar
  17. R.-E. Fan, J.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. JMLR, 9:1871--1874, August 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. J. Hsieh, K. W. Chang, C. J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In ICML, pages 408--415, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 169--184, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Joachims. Training linear SVMs in linear time. In KDD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1--64, January 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Kivinen, M. K. Warmuth, and B. Hassibi. The p-norm generaliziation of the LMS algorithm for adaptive filtering. IEEE Trans. Signal Processing, 54(5):1782--1793, May 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Langford, A. J. Smola, and M. Zinkevich. Slow learners are fast. http://arxiv.org/abs/0911.0491, 2009.Google ScholarGoogle Scholar
  24. Z. Q. Luo and P. Tseng. On the convergence of coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72(1):7--35, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. Mann, R. McDonald, M. Mohri, N. Silberman, and D. Walker. Efficient large-scale distributed training of conditional maximum entropy models. In NIPS 22, pages 1231--1239, 2009.Google ScholarGoogle Scholar
  26. N. Murata, S. Yoshizawa, and S. Amari. Network information criterion - determining the number of hidden units for artificial neural network models. IEEE Trans. Neural Networks, 5:865--872, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. E. Osuna and F. Girosi. Reducing the run-time complexity in support vector regression. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 271--284, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 185--208, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002.Google ScholarGoogle Scholar
  30. S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In ICML, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, A. Strehl, and S. V. N. Vishwanathan. Hash kernels. In AISTATS, 2009.Google ScholarGoogle Scholar
  32. A. J. Smola and S. Narayanamurthy. An architecture for parallel topic models. In VLDB, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Sonnenburg and V. Franc. COFFIN: A computational framework for linear SVMs. In ICML, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Sonnenburg, V. Franc, E. Yom-Tov, and M. Sebag. Pascal large scale learning challenge. 2008. http://largescale.ml.tu-berlin.de/workshop/.Google ScholarGoogle Scholar
  35. C. H. Teo, S. V. N. Vishwanthan, A. J. Smola, and Q. V. Le. Bundle methods for regularized risk minimization. JMLR, 11:311--365, January 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity. Technical report, Department of Statistics, UC Berkeley, 2006.Google ScholarGoogle Scholar
  37. K. Weinberger, A. Dasgupta, J. Attenberg, J. Langford, and A. J. Smola. Feature hashing for large scale multitask learning. In ICML, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. H. F. Yu, C. J. Hsieh, K. W. Chang, and C. J. Lin. Large linear classification when data cannot fit in memory. In KDD, pages 833--842, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. L. Zanni, T. Serafini, and G. Zanghirati. Parallel software for training large scale support vector machines on multiprocessor systems. JMLR, 7:1467--1492, July 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. Zinkevich, A. Smola, M. Weimer, and L. Li. Parallelized stochastic gradient descent. In NIPS 23, pages 2595--2603, 2010.Google ScholarGoogle Scholar

Index Terms

  1. Linear support vector machines via dual cached loops

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '12: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
        August 2012
        1616 pages
        ISBN:9781450314626
        DOI:10.1145/2339530

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 August 2012

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader