research-article

Open Access

Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

Authors:
Jiaqi Ma

University of Michigan, Ann Arbor, Ann Arbor, MI, USA

University of Michigan, Ann Arbor, Ann Arbor, MI, USA
View Profile

,
Zhe Zhao

Google Inc., Mountain View, CA, USA

Google Inc., Mountain View, CA, USA
View Profile

,
Xinyang Yi

Google Inc., Mountain View, CA, USA

Google Inc., Mountain View, CA, USA
View Profile

,
Jilin Chen

Google Inc., Mountain View, CA, USA

Google Inc., Mountain View, CA, USA
View Profile

,
Lichan Hong

Google Inc., Mountain View, CA, USA

Google Inc., Mountain View, CA, USA
View Profile

,
Ed H. Chi

Google Inc., Mountain View, CA, USA

Google Inc., Mountain View, CA, USA
View Profile

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningJuly 2018Pages 1930–1939https://doi.org/10.1145/3219819.3220007

Published:19 July 2018Publication History

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 1930–1939

ABSTRACT

Neural-based multi-task learning has been successfully used in many real-world large-scale applications such as recommendation systems. For example, in movie recommendations, beyond providing users movies which they tend to purchase and watch, the system might also optimize for users liking the movies afterwards. With multi-task learning, we aim to build a single model that learns these multiple goals and tasks simultaneously. However, the prediction quality of commonly used multi-task models is often sensitive to the relationships between tasks. It is therefore important to study the modeling tradeoffs between task-specific objectives and inter-task relationships. In this work, we propose a novel multi-task learning approach, Multi-gate Mixture-of-Experts (MMoE), which explicitly learns to model task relationships from data. We adapt the Mixture-of-Experts (MoE) structure to multi-task learning by sharing the expert submodels across all tasks, while also having a gating network trained to optimize each task. To validate our approach on data with different levels of task relatedness, we first apply it to a synthetic dataset where we control the task relatedness. We show that the proposed approach performs better than baseline methods when the tasks are less related. We also show that the MMoE structure results in an additional trainability benefit, depending on different levels of randomness in the training data and model initialization. Furthermore, we demonstrate the performance improvements by MMoE on real tasks including a binary classification benchmark, and a large-scale content recommendation system at Google.

Supplemental Material

ma_modeling_relationships.mp4

mp4

436.8 MB

Download

References

Mart'ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et almbox. . 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).Google ScholarDigital Library
Arthur Asuncion and David Newman . 2007. UCI machine learning repository. (2007).Google Scholar
Trapit Bansal, David Belanger, and Andrew McCallum . 2016. Ask the gru: Multi-task learning for deep text recommendations Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 107--114. Google ScholarDigital Library
Jonathan Baxter et almbox. . 2000. A model of inductive bias learning. J. Artif. Intell. Res.(JAIR) Vol. 12, 149--198 (2000), 3. Google ScholarDigital Library
Shai Ben-David, Johannes Gehrke, and Reba Schuller . 2002. A theoretical framework for learning from a pool of disparate data sources Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 443--449. Google ScholarDigital Library
Shai Ben-David, Reba Schuller, et almbox. . 2003. Exploiting task relatedness for multiple task learning. Lecture notes in computer science (2003), 567--580.Google Scholar
Yoshua Bengio, Nicholas Léonard, and Aaron Courville . 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013).Google Scholar
Rich Caruana . 1998. Multitask learning. Learning to learn. Springer, 95--133. Google ScholarDigital Library
R Caruna . 1993. Multitask learning: A knowledge-based source of inductive bias Machine Learning: Proceedings of the Tenth International Conference. 41--48. Google ScholarDigital Library
Jasmine Collins, Jascha Sohl-Dickstein, and David Sussillo . 2016. Capacity and Trainability in Recurrent Neural Networks. arXiv preprint arXiv:1611.09913 (2016).Google Scholar
Paul Covington, Jay Adams, and Emre Sargin . 2016. Deep neural networks for youtube recommendations. Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 191--198. Google ScholarDigital Library
Andrew Davis and Itamar Arel . 2013. Low-rank approximations for conditional feedforward computation in deep neural networks. arXiv preprint arXiv:1312.4461 (2013).Google Scholar
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et almbox. . 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231. Google ScholarDigital Library
Thomas Desautels, Andreas Krause, and Joel W Burdick . 2014. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. The Journal of Machine Learning Research Vol. 15, 1 (2014), 3873--3923. Google ScholarDigital Library
Long Duong, Trevor Cohn, Steven Bird, and Paul Cook . 2015. Low Resource Dependency Parsing: Cross-lingual Parameter Sharing in a Neural Network Parser. ACL (2). 845--850.Google Scholar
David Eigen, Marc'Aurelio Ranzato, and Ilya Sutskever . 2013. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314 (2013).Google Scholar
Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu, Alexander Pritzel, and Daan Wierstra . 2017. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734 (2017).Google Scholar
Ross Girshick . 2015. Fast r-cnn Proceedings of the IEEE international conference on computer vision. 1440--1448. Google ScholarDigital Library
Xavier Glorot and Yoshua Bengio . 2010. Understanding the difficulty of training deep feedforward neural networks Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 249--256.Google Scholar
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean . 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).Google Scholar
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton . 1991. Adaptive mixtures of local experts. Neural computation, Vol. 3, 1 (1991), 79--87.Google Scholar
Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et almbox. . 2016. Google's multilingual neural machine translation system: enabling zero-shot translation. arXiv preprint arXiv:1611.04558 (2016).Google Scholar
Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit . 2017. One Model To Learn Them All. arXiv preprint arXiv:1706.05137 (2017).Google Scholar
Zhuoliang Kang, Kristen Grauman, and Fei Sha . 2011. Learning with whom to share in multi-task feature learning Proceedings of the 28th International Conference on Machine Learning (ICML-11). 521--528. Google ScholarDigital Library
Diederik Kingma and Jimmy Ba . 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser . 2015. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114 (2015).Google Scholar
Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert . 2016. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3994--4003.Google ScholarCross Ref
Xia Ning and George Karypis . 2010. Multi-task learning for recommender system. In Proceedings of 2nd Asian Conference on Machine Learning. 269--284.Google Scholar
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks Advances in neural information processing systems. 91--99. Google ScholarDigital Library
Sebastian Ruder . 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017).Google Scholar
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean . 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017).Google Scholar
Jasper Snoek, Hugo Larochelle, and Ryan P Adams . 2012. Practical bayesian optimization of machine learning algorithms Advances in neural information processing systems. 2951--2959. Google ScholarDigital Library
Shengyang Sun, Changyou Chen, and Lawrence Carin . 2017. Learning Structured Weight Uncertainty in Bayesian Neural Networks Artificial Intelligence and Statistics. 1283--1292.Google Scholar
Yongxin Yang and Timothy Hospedales . 2016. Deep multi-task representation learning: A tensor factorisation approach. arXiv preprint arXiv:1605.06391 (2016).Google Scholar
Zhe Zhao, Zhiyuan Cheng, Lichan Hong, and Ed H Chi . 2015. Improving user topic interest profiles by behavior factorization Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1406--1416. Google ScholarDigital Library

Index Terms

Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Multi-task learning
    2. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Recommender systems

Recommendations

Metric-Guided Multi-task Learning
Foundations of Intelligent Systems
Abstract
Multi-task learning (MTL) aims to solve multiple related learning tasks simultaneously so that the useful information in one specific task can be utilized by other tasks in order to improve the learning performance of all tasks. Many ...
Read More
Hierarchical Task-aware Multi-Head Attention Network
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Neural Multi-task Learning is gaining popularity as a way to learn multiple tasks jointly within a single model. While related research continues to break new ground, two major limitations still remain, including (i) poor generalization to scenarios ...
Read More
Saliency-Regularized Deep Multi-Task Learning
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Multi-task learning (MTL) is a framework that enforces multiple learning tasks to share their knowledge to improve their generalization abilities. While shallow multi-task learning can learn task relations, it can only handle pre-defined features. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2018
2925 pages
ISBN:9781450355520
DOI:10.1145/3219819
General Chairs:
Yike Guo
Imperial College London
,
Faisal Farooq
IBM
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
mixture of experts
multi-task learning
neural network
recommendation system
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '18 Paper Acceptance Rate107of983submissions,11%Overall Acceptance Rate1,133of8,635submissions,13%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 445
  Total Citations
  View Citations
- 81,222
  Total Downloads
- Downloads (Last 12 months)26,536
- Downloads (Last 6 weeks)2,169
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Metric-Guided Multi-task Learning

Hierarchical Task-aware Multi-Head Attention Network

Saliency-Regularized Deep Multi-Task Learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Metric-Guided Multi-task Learning

Hierarchical Task-aware Multi-Head Attention Network

Saliency-Regularized Deep Multi-Task Learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media