skip to main content
10.1145/3219819.3220007acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Open Access

Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

Published:19 July 2018Publication History

ABSTRACT

Neural-based multi-task learning has been successfully used in many real-world large-scale applications such as recommendation systems. For example, in movie recommendations, beyond providing users movies which they tend to purchase and watch, the system might also optimize for users liking the movies afterwards. With multi-task learning, we aim to build a single model that learns these multiple goals and tasks simultaneously. However, the prediction quality of commonly used multi-task models is often sensitive to the relationships between tasks. It is therefore important to study the modeling tradeoffs between task-specific objectives and inter-task relationships. In this work, we propose a novel multi-task learning approach, Multi-gate Mixture-of-Experts (MMoE), which explicitly learns to model task relationships from data. We adapt the Mixture-of-Experts (MoE) structure to multi-task learning by sharing the expert submodels across all tasks, while also having a gating network trained to optimize each task. To validate our approach on data with different levels of task relatedness, we first apply it to a synthetic dataset where we control the task relatedness. We show that the proposed approach performs better than baseline methods when the tasks are less related. We also show that the MMoE structure results in an additional trainability benefit, depending on different levels of randomness in the training data and model initialization. Furthermore, we demonstrate the performance improvements by MMoE on real tasks including a binary classification benchmark, and a large-scale content recommendation system at Google.

Skip Supplemental Material Section

Supplemental Material

ma_modeling_relationships.mp4

mp4

436.8 MB

References

  1. Mart'ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et almbox. . 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Arthur Asuncion and David Newman . 2007. UCI machine learning repository. (2007).Google ScholarGoogle Scholar
  3. Trapit Bansal, David Belanger, and Andrew McCallum . 2016. Ask the gru: Multi-task learning for deep text recommendations Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 107--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jonathan Baxter et almbox. . 2000. A model of inductive bias learning. J. Artif. Intell. Res.(JAIR) Vol. 12, 149--198 (2000), 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Shai Ben-David, Johannes Gehrke, and Reba Schuller . 2002. A theoretical framework for learning from a pool of disparate data sources Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 443--449. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Shai Ben-David, Reba Schuller, et almbox. . 2003. Exploiting task relatedness for multiple task learning. Lecture notes in computer science (2003), 567--580.Google ScholarGoogle Scholar
  7. Yoshua Bengio, Nicholas Léonard, and Aaron Courville . 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013).Google ScholarGoogle Scholar
  8. Rich Caruana . 1998. Multitask learning. Learning to learn. Springer, 95--133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R Caruna . 1993. Multitask learning: A knowledge-based source of inductive bias Machine Learning: Proceedings of the Tenth International Conference. 41--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jasmine Collins, Jascha Sohl-Dickstein, and David Sussillo . 2016. Capacity and Trainability in Recurrent Neural Networks. arXiv preprint arXiv:1611.09913 (2016).Google ScholarGoogle Scholar
  11. Paul Covington, Jay Adams, and Emre Sargin . 2016. Deep neural networks for youtube recommendations. Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 191--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Andrew Davis and Itamar Arel . 2013. Low-rank approximations for conditional feedforward computation in deep neural networks. arXiv preprint arXiv:1312.4461 (2013).Google ScholarGoogle Scholar
  13. Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et almbox. . 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Thomas Desautels, Andreas Krause, and Joel W Burdick . 2014. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. The Journal of Machine Learning Research Vol. 15, 1 (2014), 3873--3923. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Long Duong, Trevor Cohn, Steven Bird, and Paul Cook . 2015. Low Resource Dependency Parsing: Cross-lingual Parameter Sharing in a Neural Network Parser. ACL (2). 845--850.Google ScholarGoogle Scholar
  16. David Eigen, Marc'Aurelio Ranzato, and Ilya Sutskever . 2013. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314 (2013).Google ScholarGoogle Scholar
  17. Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu, Alexander Pritzel, and Daan Wierstra . 2017. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734 (2017).Google ScholarGoogle Scholar
  18. Ross Girshick . 2015. Fast r-cnn Proceedings of the IEEE international conference on computer vision. 1440--1448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Xavier Glorot and Yoshua Bengio . 2010. Understanding the difficulty of training deep feedforward neural networks Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 249--256.Google ScholarGoogle Scholar
  20. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean . 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).Google ScholarGoogle Scholar
  21. Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton . 1991. Adaptive mixtures of local experts. Neural computation, Vol. 3, 1 (1991), 79--87.Google ScholarGoogle Scholar
  22. Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et almbox. . 2016. Google's multilingual neural machine translation system: enabling zero-shot translation. arXiv preprint arXiv:1611.04558 (2016).Google ScholarGoogle Scholar
  23. Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit . 2017. One Model To Learn Them All. arXiv preprint arXiv:1706.05137 (2017).Google ScholarGoogle Scholar
  24. Zhuoliang Kang, Kristen Grauman, and Fei Sha . 2011. Learning with whom to share in multi-task feature learning Proceedings of the 28th International Conference on Machine Learning (ICML-11). 521--528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Diederik Kingma and Jimmy Ba . 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  26. Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser . 2015. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114 (2015).Google ScholarGoogle Scholar
  27. Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert . 2016. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3994--4003.Google ScholarGoogle ScholarCross RefCross Ref
  28. Xia Ning and George Karypis . 2010. Multi-task learning for recommender system. In Proceedings of 2nd Asian Conference on Machine Learning. 269--284.Google ScholarGoogle Scholar
  29. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks Advances in neural information processing systems. 91--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Sebastian Ruder . 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017).Google ScholarGoogle Scholar
  31. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean . 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017).Google ScholarGoogle Scholar
  32. Jasper Snoek, Hugo Larochelle, and Ryan P Adams . 2012. Practical bayesian optimization of machine learning algorithms Advances in neural information processing systems. 2951--2959. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Shengyang Sun, Changyou Chen, and Lawrence Carin . 2017. Learning Structured Weight Uncertainty in Bayesian Neural Networks Artificial Intelligence and Statistics. 1283--1292.Google ScholarGoogle Scholar
  34. Yongxin Yang and Timothy Hospedales . 2016. Deep multi-task representation learning: A tensor factorisation approach. arXiv preprint arXiv:1605.06391 (2016).Google ScholarGoogle Scholar
  35. Zhe Zhao, Zhiyuan Cheng, Lichan Hong, and Ed H Chi . 2015. Improving user topic interest profiles by behavior factorization Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1406--1416. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
          July 2018
          2925 pages
          ISBN:9781450355520
          DOI:10.1145/3219819

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 19 July 2018

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          KDD '18 Paper Acceptance Rate107of983submissions,11%Overall Acceptance Rate1,133of8,635submissions,13%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader