skip to main content
10.1145/2556195.2556264acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Scalable hierarchical multitask learning algorithms for conversion optimization in display advertising

Published:24 February 2014Publication History

ABSTRACT

Many estimation tasks come in groups and hierarchies of related problems. In this paper we propose a hierarchical model and a scalable algorithm to perform inference for multitask learning. It infers task correlation and subtask structure in a joint sparse setting. Implementation is achieved by a distributed subgradient oracle and the successive application of prox-operators pertaining to groups and subgroups of variables. We apply this algorithm to conversion optimization in display advertising. Experimental results on over 1TB data for up to 1 billion observations and 1 million attributes show that the algorithm provides significantly better prediction accuracy while simultaneously beingefficiently scalable by distributed parameter synchronization.

References

  1. A. Ahmed, M. Aly, A. Das, A. Smola, and T. Anastasakos. Web-scale multi-task feature selection for behavioral targeting. In CIKM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. Smola. Scalable inference in latent variable models. In Web Science and Data Mining (WSDM), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Aly, A. Hatch, V. Josifovski, and V. K. Narayanan. Web-scale user modeling for targeting. In WWW, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243--272, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1):1--106, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183--202, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Caruana. Multitask learning. Machine Learning, 28:41--75, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. R. Draper and H. Smith. Applied Regression Analysis. John Wiley and Sons, New York, NY, 1981.Google ScholarGoogle Scholar
  9. J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432--441, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  10. R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. Journal of Machine Learning Research, 12:2297--2334, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Karger, E. Lehman, T. Leighton, M. Levine, D. Lewin, and R. Panigrahy. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In Symposium on the Theory of Computing STOC, pages 654--663, New York, May 1997. Association for Computing Machinery. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. R. Shultz and F. Rivest. Using knowledge to speed learning: A comparison knowledge-based cascade-correlation and multi-task learning. In Proc. Intl. Conf. Machine Learning, pages 871--878. Morgan Kaufmann, San Francisco, CA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. J. Smola and S. Narayanamurthy. An architecture for parallel topic models. In Very Large Databases (VLDB), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf. Large scale multiple kernel learning. Journal of Machine Learning Research, 7:1531--1565, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. W. H. Southwell. Fitting data to nonlinear functions with uncertainties in all measurement variables. Comput. J., 19(1):69--73, 1976.Google ScholarGoogle ScholarCross RefCross Ref
  16. N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In P. Auer and R. Meir, editors, Proc. Annual Conf. Computational Learning Theory, number 3559 in Lecture Notes in Artificial Intelligence, pages 545--560. Springer-Verlag, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Teo, Q. Le, A. J. Smola, and S. V. N. Vishwanathan. A scalable modular convex solver for regularized risk minimization. In Proc. ACM Conf. Knowledge Discovery and Data Mining (KDD). ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Thrun and J. O'Sullivan. Discovering structure in multiple learning tasks: the TC algorithm. In Proc. Intl. Conf. Machine Learning, pages 489--497. Morgan Kaufmann, 1996.Google ScholarGoogle Scholar
  19. M. Varma and B. R. Babu. More generality in efficient multiple kernel learning. In A. P. Danyluk, L. Bottou, and M. L. Littman, editors, ICML, volume 382 of ACM International Conference Proceeding Series, page 134. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. K. Weinberger, A. Dasgupta, J. Attenberg, J. Langford, and A. J. Smola. Feature hashing for large scale multitask learning. In L. Bottou and M. Littman, editors, International Conference on Machine Learning, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Ye, J. Chow, J. Chen, and Z. Zheng. Stochastic gradient boosted distributed decision trees. In CIKM. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Yu, V. Tresp, and A. Schwaighofer. Learning gaussian processes from multiple tasks. In Proceedings of the 22nd International Conference on Machine Learning, volume 119, pages 1012--1019. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. Zhang and D.-Y. Yeung. A convex formulation for learning task relationships in multi-task learning. In Uncertainty in Artificial Intelligence, 2010.Google ScholarGoogle Scholar

Index Terms

  1. Scalable hierarchical multitask learning algorithms for conversion optimization in display advertising

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            WSDM '14: Proceedings of the 7th ACM international conference on Web search and data mining
            February 2014
            712 pages
            ISBN:9781450323512
            DOI:10.1145/2556195

            Copyright © 2014 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 24 February 2014

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            WSDM '14 Paper Acceptance Rate64of355submissions,18%Overall Acceptance Rate498of2,863submissions,17%

            Upcoming Conference

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader