XGBoost: A Scalable Tree Boosting System

Authors:
Tianqi Chen

University of Washington, Seattle, WA, USA

University of Washington, Seattle, WA, USA
View Profile

,
Carlos Guestrin

University of Washington, Seattle, WA, USA

University of Washington, Seattle, WA, USA
View Profile

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningAugust 2016Pages 785–794https://doi.org/10.1145/2939672.2939785

Published:13 August 2016Publication History

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 785–794

ABSTRACT

Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

Supplemental Material

kdd2016_chen_boosting_system_01-acm.mp4

mp4

396.1 MB

Download

References

R. Bekkerman. The present and the future of the kdd cup competition: an outsider's perspective.Google Scholar
R. Bekkerman, M. Bilenko, and J. Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011. Google ScholarDigital Library
J. Bennett and S. Lanning. The netflix prize. In Proceedings of the KDD Cup Workshop 2007, pages 3--6, New York, Aug. 2007.Google Scholar
L. Breiman. Random forests. Maching Learning, 45(1):5--32, Oct. 2001. Google ScholarDigital Library
C. Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23--581, 2010.Google Scholar
O. Chapelle and Y. Chang. Yahoo! Learning to Rank Challenge Overview. Journal of Machine Learning Research - W & CP, 14:1--24, 2011.Google Scholar
T. Chen, H. Li, Q. Yang, and Y. Yu. General functional matrix factorization using gradient boosting. In Proceeding of 30th International Conference on Machine Learning (ICML'13), volume 1, pages 436--444, 2013.Google Scholar
T. Chen, S. Singh, B. Taskar, and C. Guestrin. Efficient second-order gradient boosting for conditional random fields. In Proceeding of 18th Artificial Intelligence and Statistics Conference (AISTATS'15), volume 1, 2015.Google Scholar
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871--1874, 2008. Google ScholarDigital Library
J. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189--1232, 2001.Google ScholarCross Ref
J. Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367--378, 2002. Google ScholarDigital Library
J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337--407, 2000.Google ScholarCross Ref
J. H. Friedman and B. E. Popescu. Importance sampled learning ensembles, 2003.Google Scholar
M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pages 58--66, 2001. Google ScholarDigital Library
X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, and J. Q. n. Candela. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, ADKDD'14, 2014. Google ScholarDigital Library
P. Li. Robust Logitboost and adaptive base class (ABC) Logitboost. In Proceedings of the Twenty-Sixth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI'10), pages 302--311, 2010.Google ScholarDigital Library
P. Li, Q. Wu, and C. J. Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In Advances in Neural Information Processing Systems 20, pages 897--904. 2008. Google ScholarDigital Library
X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine learning in apache spark. Journal of Machine Learning Research, 17(34):1--7, 2016. Google ScholarDigital Library
B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. Planet: Massively parallel learning of tree ensembles with mapreduce. Proceeding of VLDB Endowment, 2(2):1426--1437, Aug. 2009. Google ScholarDigital Library
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011. Google ScholarDigital Library
G. Ridgeway. Generalized Boosted Models: A guide to the gbm package.Google Scholar
S. Tyree, K. Weinberger, K. Agrawal, and J. Paykin. Parallel boosted regression trees for web search ranking. In Proceedings of the 20th international conference on World wide web, pages 387--396. ACM, 2011. Google ScholarDigital Library
J. Ye, J.-H. Chow, J. Chen, and Z. Zheng. Stochastic gradient boosted distributed decision trees. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM '09. Google ScholarDigital Library
Q. Zhang and W. Wang. A fast algorithm for approximate quantiles in high speed data streams. In Proceedings of the 19th International Conference on Scientific and Statistical Database Management, 2007. Google ScholarDigital Library
T. Zhang and R. Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.Google Scholar

Index Terms

XGBoost: A Scalable Tree Boosting System
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Factorizing YAGO: scalable machine learning for linked data
WWW '12: Proceedings of the 21st international conference on World Wide Web

Vast amounts of structured information have been published in the Semantic Web's Linked Open Data (LOD) cloud and their size is still growing rapidly. Yet, access to this information via reasoning and querying is sometimes difficult, due to LOD's size, ...
Read More
Large-Scale Machine Learning at Verizon: Theory and Applications
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

This talk will cover recent innovations in large-scale machine learning and their applications on massive, real-world data sets at Verizon. These applications power new revenue generating products and services for the company and are hosted on a massive ...
Read More
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

With massive high-dimensional data now commonplace in research and industry, there is a strong and growing demand for more scalable computational techniques for data analysis and knowledge discovery. Key to turning these data into knowledge is the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2016
2176 pages
ISBN:9781450342322
DOI:10.1145/2939672
General Chairs:
Balaji Krishnapuram
IBM
,
Mohak Shah
Bosch
,
Program Chairs:
Alex Smola
Amazon
,
Charu Aggarwal
IBM
,
Dou Shen
Baidu
,
Rajeev Rastogi
Amazon
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 August 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
large-scale machine learning
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '16 Paper Acceptance Rate66of1,115submissions,6%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17,951
  Total Citations
  View Citations
- 144,835
  Total Downloads
- Downloads (Last 12 months)39,216
- Downloads (Last 6 weeks)6,020
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

XGBoost: A Scalable Tree Boosting System

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Factorizing YAGO: scalable machine learning for linked data

Large-Scale Machine Learning at Verizon: Theory and Applications

Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices