research-article

DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks

Authors:
Guolin Ke

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China
View Profile

,
Zhenhui Xu

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Jia Zhang

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China
View Profile

,
Jiang Bian

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China
View Profile

,
Tie-Yan Liu

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China
View Profile

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningJuly 2019Pages 384–394https://doi.org/10.1145/3292500.3330858

Published:25 July 2019Publication History

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 384–394

ABSTRACT

Online prediction has become one of the most essential tasks in many real-world applications. Two main characteristics of typical online prediction tasks include tabular input space and online data generation. Specifically, tabular input space indicates the existence of both sparse categorical features and dense numerical ones, while online data generation implies continuous task-generated data with potentially dynamic distribution. Consequently, effective learning with tabular input space as well as fast adaption to online data generation become two vital challenges for obtaining the online prediction model. Although Gradient Boosting Decision Tree (GBDT) and Neural Network (NN) have been widely used in practice, either of them yields their own weaknesses. Particularly, GBDT can hardly be adapted to dynamic online data generation, and it tends to be ineffective when facing sparse categorical features; NN, on the other hand, is quite difficult to achieve satisfactory performance when facing dense numerical features. In this paper, we propose a new learning framework, DeepGBM, which integrates the advantages of the both NN and GBDT by using two corresponding NN components: (1) CatNN, focusing on handling sparse categorical features. (2) GBDT2NN, focusing on dense numerical features with distilled knowledge from GBDT. Powered by these two components, DeepGBM can leverage both categorical and numerical features while retaining the ability of efficient online update. Comprehensive experiments on a variety of publicly available datasets have demonstrated that DeepGBM can outperform other well-recognized baselines in various online prediction tasks.

References

Eugene Agichtein, Eric Brill, and Susan Dumais. 2006. Improving web search ranking by incorporating user behavior information. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 19--26. Google ScholarDigital Library
Arunava Banerjee. 1997. Initializing neural networks using decision trees. Computational learning theory and natural learning systems, Vol. 4 (1997), 3--15. Google ScholarDigital Library
I nigo Barandiaran. 1998. The random subspace method for constructing decision forests. IEEE transactions on pattern analysis and machine intelligence, Vol. 20, 8 (1998). Google ScholarDigital Library
Yael Ben-Haim and Elad Tom-Tov. 2010. A streaming parallel decision tree algorithm. Journal of Machine Learning Research, Vol. 11, Feb (2010), 849--872. Google ScholarDigital Library
Gérard Biau, Erwan Scornet, and Johannes Welbl. 2016. Neural random forests. Sankhya A (2016), 1--40.Google Scholar
Christopher JC Burges. 2010. From ranknet to lambdarank to lambdamart: An overview. Learning, Vol. 11, 23--581 (2010), 81.Google Scholar
Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning. ACM, 129--136. Google ScholarDigital Library
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 785--794. Google ScholarDigital Library
Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 7--10. Google ScholarDigital Library
Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 191--198. Google ScholarDigital Library
Pedro Domingos and Geoff Hulten. 2000. Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 71--80. Google ScholarDigital Library
Anna Veronika Dorogush, Vasily Ershov, and Andrey Gulin. 2018. CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363 (2018).Google Scholar
James Dougherty, Ron Kohavi, and Mehran Sahami. 1995. Supervised and unsupervised discretization of continuous features. In Machine Learning Proceedings 1995. Elsevier, 194--202. Google ScholarDigital Library
Ji Feng, Yang Yu, and Zhi-Hua Zhou. 2018. Multi-Layered Gradient Boosting Decision Trees. arXiv preprint arXiv:1806.00007 (2018).Google Scholar
Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. 2014. Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research, Vol. 15, 1 (2014), 3133--3181. Google ScholarDigital Library
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Vol. 1. Springer series in statistics New York, NY, USA:.Google Scholar
Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.Google Scholar
Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy. 2005. Mining data streams: a review. ACM Sigmod Record, Vol. 34, 2 (2005), 18--26. Google ScholarDigital Library
Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep learning. Vol. 1. MIT press Cambridge. Google ScholarDigital Library
Krzysztof Grabczewski and Norbert Jankowski. 2005. Feature selection with decision tree criterion. In null. IEEE, 212--217. Google ScholarDigital Library
Thore Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf Herbrich. 2010. Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft's bing search engine. Omnipress.Google ScholarDigital Library
Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. Deepfm: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247 (2017).Google Scholar
Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, et al. 2014. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising. ACM, 1--9. Google ScholarDigital Library
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).Google Scholar
K. D. Humbird, J. L. Peterson, and R. G. McClarren. 2017. Deep neural network initialization with decision trees. ArXiv e-prints (July 2017). arxiv: 1707.00784Google Scholar
Yani Ioannou, Duncan Robertson, Darko Zikic, Peter Kontschieder, Jamie Shotton, Matthew Brown, and Antonio Criminisi. 2016. Decision forests, convolutional networks and the models in-between. arXiv preprint arXiv:1603.01250 (2016).Google Scholar
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).Google ScholarDigital Library
Ruoming Jin and Gagan Agrawal. 2003. Efficient decision tree construction on streaming data. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 571--576. Google ScholarDigital Library
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems. 3146--3154. Google ScholarDigital Library
Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, and Samuel Rota Bulo. 2015. Deep neural decision forests. In Proceedings of the IEEE international conference on computer vision. 1467--1475. Google ScholarDigital Library
Yaguang Li, Kun Fu, Zheng Wang, Cyrus Shahabi, Jieping Ye, and Yan Liu. 2018. Multi-task representation learning for travel time estimation. In International Conference on Knowledge Discovery and Data Mining,(KDD) . Google ScholarDigital Library
Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. arXiv preprint arXiv:1803.05170 (2018).Google Scholar
Xiaoliang Ling, Weiwei Deng, Chen Gu, Hucheng Zhou, Cui Li, and Feng Sun. 2017. Model ensemble for click prediction in bing search ads. In Proceedings of the 26th International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, 689--698. Google ScholarDigital Library
Qi Meng, Guolin Ke, Taifeng Wang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, and Tie-Yan Liu. 2016. A communication-efficient parallel algorithm for decision tree. In Advances in Neural Information Processing Systems. 1279--1287. Google ScholarDigital Library
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google Scholar
Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang. 2016. Product-based neural networks for user response prediction. In Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 1149--1154.Google ScholarCross Ref
Yao Quanming, Wang Mengshuo, Jair Escalante Hugo, Guyon Isabelle, Hu Yi-Qi, Li Yu-Feng, Tu Wei-Wei, Yang Qiang, and Yu Yang. 2018. Taking human out of learning applications: A survey on automated machine learning. arXiv preprint arXiv:1810.13306 (2018).Google Scholar
Steffen Rendle. 2010. Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, 995--1000. Google ScholarDigital Library
David L Richmond, Dagmar Kainmueller, Michael Y Yang, Eugene W Myers, and Carsten Rother. 2015. Relating cascaded random forests to deep convolutional neural networks for semantic segmentation. arXiv preprint arXiv:1507.07583 (2015).Google Scholar
Samuel Rota Bulo and Peter Kontschieder. 2014. Neural decision forests for semantic image labelling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 81--88. Google ScholarDigital Library
Scikit-learn. 2018. categorical_encoding. https://github.com/scikit-learn-contrib/categorical-encoding .Google Scholar
Ishwar Krishnan Sethi. 1990. Entropy nets: from decision trees to neural networks. Proc. IEEE, Vol. 78, 10 (1990), 1605--1613.Google ScholarCross Ref
Ira Shavitt and Eran Segal. 2018. Regularization Learning Networks: Deep Learning for Tabular Datasets. In Advances in Neural Information Processing Systems. 1386--1396. Google ScholarDigital Library
Jeany Son, Ilchae Jung, Kayoung Park, and Bohyung Han. 2015. Tracking-by-segmentation with online gradient boosting decision tree. In Proceedings of the IEEE International Conference on Computer Vision. 3056--3064. Google ScholarDigital Library
V Sugumaran, V Muralidharan, and KI Ramachandran. 2007. Feature selection using decision tree and classification through proximal support vector machine for fault diagnostics of roller bearing. Mechanical systems and signal processing, Vol. 21, 2 (2007), 930--942.Google Scholar
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.Google ScholarCross Ref
Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015. Collaborative deep learning for recommender systems. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1235--1244. Google ScholarDigital Library
Suhang Wang, Charu Aggarwal, and Huan Liu. 2017. Using a random forest to inspire a neural network and improving on it. In Proceedings of the 2017 SIAM International Conference on Data Mining. SIAM, 1--9.Google ScholarCross Ref
Zheng Wang, Kun Fu, and Jieping Ye. 2018. Learning to estimate the travel time. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 858--866. Google ScholarDigital Library
Yongxin Yang, Irene Garcia Morillo, and Timothy M Hospedales. 2018. Deep Neural Decision Trees. arXiv preprint arXiv:1806.06988 (2018).Google Scholar
Weinan Zhang, Tianming Du, and Jun Wang. 2016. Deep learning over multi-field categorical data. In European conference on information retrieval. Springer, 45--57.Google ScholarCross Ref
Zhi-Hua Zhou and Ji Feng. 2017. Deep forest: Towards an alternative to deep neural networks. arXiv preprint arXiv:1702.08835 (2017). Google ScholarDigital Library
Jie Zhu, Ying Shan, JC Mao, Dong Yu, Holakou Rahmanian, and Yi Zhang. 2017. Deep embedding forest: Forest-based serving with deep embedding features. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1703--1711. Google ScholarDigital Library

Index Terms

DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
      1. Ensemble methods
        Boosting
    2. Machine learning approaches
      1. Classification and regression trees
      2. Neural networks

Recommendations

Structured Data Encoder for Neural Networks Based on Gradient Boosting Decision Tree
Algorithms and Architectures for Parallel Processing
Abstract
Features are very important for machine learning tasks, therefore, feature engineering has been widely adopted to obtain effective handcrafted features, which is, however, labor-intensive and in need of expert knowledge. Therefore, feature ...
Read More
Feature selection with neural networks

We present a neural network based approach for identifying salient features for classification in feedforward neural networks. Our approach involves neural network training with an augmented cross-entropy error function. The augmented error function ...
Read More
Application of Adaptive Resonance Theory Neural Network for MR Brain Tumor Image Classification

In the present study, the effectiveness of the adaptive resonance theory neural network ART2 is illustrated in the context of automatic classification of abnormal brain tumor images. Abnormal images from four different classes namely metastase, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2019
3305 pages
ISBN:9781450362016
DOI:10.1145/3292500
General Chairs:
Ankur Teredesai
KenSci
,
Vipin Kumar
University of Minnesota
,
Program Chairs:
Ying Li
EV Analysis Corporation
,
Rómer Rosales
LinkedIn
,
Evimaria Terzi
Boston University
,
George Karypis
University of Minnesota
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
gradient boosting decision tree
neural network
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '19 Paper Acceptance Rate110of1,200submissions,9%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 73
  Total Citations
  View Citations
- 4,912
  Total Downloads
- Downloads (Last 12 months)207
- Downloads (Last 6 weeks)28
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Structured Data Encoder for Neural Networks Based on Gradient Boosting Decision Tree

Feature selection with neural networks

Application of Adaptive Resonance Theory Neural Network for MR Brain Tumor Image Classification

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Structured Data Encoder for Neural Networks Based on Gradient Boosting Decision Tree

Feature selection with neural networks

Application of Adaptive Resonance Theory Neural Network for MR Brain Tumor Image Classification

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media