ABSTRACT
Many estimation tasks come in groups and hierarchies of related problems. In this paper we propose a hierarchical model and a scalable algorithm to perform inference for multitask learning. It infers task correlation and subtask structure in a joint sparse setting. Implementation is achieved by a distributed subgradient oracle and the successive application of prox-operators pertaining to groups and subgroups of variables. We apply this algorithm to conversion optimization in display advertising. Experimental results on over 1TB data for up to 1 billion observations and 1 million attributes show that the algorithm provides significantly better prediction accuracy while simultaneously beingefficiently scalable by distributed parameter synchronization.
- A. Ahmed, M. Aly, A. Das, A. Smola, and T. Anastasakos. Web-scale multi-task feature selection for behavioral targeting. In CIKM, 2012. Google ScholarDigital Library
- A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. Smola. Scalable inference in latent variable models. In Web Science and Data Mining (WSDM), 2012. Google ScholarDigital Library
- M. Aly, A. Hatch, V. Josifovski, and V. K. Narayanan. Web-scale user modeling for targeting. In WWW, 2012. Google ScholarDigital Library
- A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243--272, 2008. Google ScholarDigital Library
- F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1):1--106, 2012. Google ScholarDigital Library
- A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183--202, 2009. Google ScholarDigital Library
- R. Caruana. Multitask learning. Machine Learning, 28:41--75, 1997. Google ScholarDigital Library
- N. R. Draper and H. Smith. Applied Regression Analysis. John Wiley and Sons, New York, NY, 1981.Google Scholar
- J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432--441, 2008.Google ScholarCross Ref
- R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. Journal of Machine Learning Research, 12:2297--2334, 2011. Google ScholarDigital Library
- D. Karger, E. Lehman, T. Leighton, M. Levine, D. Lewin, and R. Panigrahy. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In Symposium on the Theory of Computing STOC, pages 654--663, New York, May 1997. Association for Computing Machinery. Google ScholarDigital Library
- T. R. Shultz and F. Rivest. Using knowledge to speed learning: A comparison knowledge-based cascade-correlation and multi-task learning. In Proc. Intl. Conf. Machine Learning, pages 871--878. Morgan Kaufmann, San Francisco, CA, 2000. Google ScholarDigital Library
- A. J. Smola and S. Narayanamurthy. An architecture for parallel topic models. In Very Large Databases (VLDB), 2010. Google ScholarDigital Library
- S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf. Large scale multiple kernel learning. Journal of Machine Learning Research, 7:1531--1565, 2006. Google ScholarDigital Library
- W. H. Southwell. Fitting data to nonlinear functions with uncertainties in all measurement variables. Comput. J., 19(1):69--73, 1976.Google ScholarCross Ref
- N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In P. Auer and R. Meir, editors, Proc. Annual Conf. Computational Learning Theory, number 3559 in Lecture Notes in Artificial Intelligence, pages 545--560. Springer-Verlag, June 2005. Google ScholarDigital Library
- C. Teo, Q. Le, A. J. Smola, and S. V. N. Vishwanathan. A scalable modular convex solver for regularized risk minimization. In Proc. ACM Conf. Knowledge Discovery and Data Mining (KDD). ACM, 2007. Google ScholarDigital Library
- S. Thrun and J. O'Sullivan. Discovering structure in multiple learning tasks: the TC algorithm. In Proc. Intl. Conf. Machine Learning, pages 489--497. Morgan Kaufmann, 1996.Google Scholar
- M. Varma and B. R. Babu. More generality in efficient multiple kernel learning. In A. P. Danyluk, L. Bottou, and M. L. Littman, editors, ICML, volume 382 of ACM International Conference Proceeding Series, page 134. ACM, 2009. Google ScholarDigital Library
- K. Weinberger, A. Dasgupta, J. Attenberg, J. Langford, and A. J. Smola. Feature hashing for large scale multitask learning. In L. Bottou and M. Littman, editors, International Conference on Machine Learning, 2009. Google ScholarDigital Library
- J. Ye, J. Chow, J. Chen, and Z. Zheng. Stochastic gradient boosted distributed decision trees. In CIKM. ACM, 2009. Google ScholarDigital Library
- K. Yu, V. Tresp, and A. Schwaighofer. Learning gaussian processes from multiple tasks. In Proceedings of the 22nd International Conference on Machine Learning, volume 119, pages 1012--1019. ACM, 2005. Google ScholarDigital Library
- Y. Zhang and D.-Y. Yeung. A convex formulation for learning task relationships in multi-task learning. In Uncertainty in Artificial Intelligence, 2010.Google Scholar
Index Terms
- Scalable hierarchical multitask learning algorithms for conversion optimization in display advertising
Recommendations
Simple and Scalable Response Prediction for Display Advertising
Special Sections on Diversity and Discovery in Recommender Systems, Online Advertising and Regular PapersClickthrough and conversation rates estimation are two core predictions tasks in display advertising. We present in this article a machine learning framework based on logistic regression that is specifically designed to tackle the specifics of display ...
Multitask Learning
Special issue on inductive transferMultitask Learning is an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias. It does this by learning tasks in parallel while using a shared ...
An Analysis Of Entire Space Multi-Task Models For Post-Click Conversion Prediction
RecSys '21: Proceedings of the 15th ACM Conference on Recommender SystemsIndustrial recommender systems are frequently tasked with approximating probabilities for multiple, often closely related, user actions. For example, predicting if a user will click on an advertisement and if they will then purchase the advertised ...
Comments