ABSTRACT
Neural-based multi-task learning has been successfully used in many real-world large-scale applications such as recommendation systems. For example, in movie recommendations, beyond providing users movies which they tend to purchase and watch, the system might also optimize for users liking the movies afterwards. With multi-task learning, we aim to build a single model that learns these multiple goals and tasks simultaneously. However, the prediction quality of commonly used multi-task models is often sensitive to the relationships between tasks. It is therefore important to study the modeling tradeoffs between task-specific objectives and inter-task relationships. In this work, we propose a novel multi-task learning approach, Multi-gate Mixture-of-Experts (MMoE), which explicitly learns to model task relationships from data. We adapt the Mixture-of-Experts (MoE) structure to multi-task learning by sharing the expert submodels across all tasks, while also having a gating network trained to optimize each task. To validate our approach on data with different levels of task relatedness, we first apply it to a synthetic dataset where we control the task relatedness. We show that the proposed approach performs better than baseline methods when the tasks are less related. We also show that the MMoE structure results in an additional trainability benefit, depending on different levels of randomness in the training data and model initialization. Furthermore, we demonstrate the performance improvements by MMoE on real tasks including a binary classification benchmark, and a large-scale content recommendation system at Google.
Supplemental Material
- Mart'ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et almbox. . 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).Google ScholarDigital Library
- Arthur Asuncion and David Newman . 2007. UCI machine learning repository. (2007).Google Scholar
- Trapit Bansal, David Belanger, and Andrew McCallum . 2016. Ask the gru: Multi-task learning for deep text recommendations Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 107--114. Google ScholarDigital Library
- Jonathan Baxter et almbox. . 2000. A model of inductive bias learning. J. Artif. Intell. Res.(JAIR) Vol. 12, 149--198 (2000), 3. Google ScholarDigital Library
- Shai Ben-David, Johannes Gehrke, and Reba Schuller . 2002. A theoretical framework for learning from a pool of disparate data sources Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 443--449. Google ScholarDigital Library
- Shai Ben-David, Reba Schuller, et almbox. . 2003. Exploiting task relatedness for multiple task learning. Lecture notes in computer science (2003), 567--580.Google Scholar
- Yoshua Bengio, Nicholas Léonard, and Aaron Courville . 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013).Google Scholar
- Rich Caruana . 1998. Multitask learning. Learning to learn. Springer, 95--133. Google ScholarDigital Library
- R Caruna . 1993. Multitask learning: A knowledge-based source of inductive bias Machine Learning: Proceedings of the Tenth International Conference. 41--48. Google ScholarDigital Library
- Jasmine Collins, Jascha Sohl-Dickstein, and David Sussillo . 2016. Capacity and Trainability in Recurrent Neural Networks. arXiv preprint arXiv:1611.09913 (2016).Google Scholar
- Paul Covington, Jay Adams, and Emre Sargin . 2016. Deep neural networks for youtube recommendations. Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 191--198. Google ScholarDigital Library
- Andrew Davis and Itamar Arel . 2013. Low-rank approximations for conditional feedforward computation in deep neural networks. arXiv preprint arXiv:1312.4461 (2013).Google Scholar
- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et almbox. . 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231. Google ScholarDigital Library
- Thomas Desautels, Andreas Krause, and Joel W Burdick . 2014. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. The Journal of Machine Learning Research Vol. 15, 1 (2014), 3873--3923. Google ScholarDigital Library
- Long Duong, Trevor Cohn, Steven Bird, and Paul Cook . 2015. Low Resource Dependency Parsing: Cross-lingual Parameter Sharing in a Neural Network Parser. ACL (2). 845--850.Google Scholar
- David Eigen, Marc'Aurelio Ranzato, and Ilya Sutskever . 2013. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314 (2013).Google Scholar
- Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu, Alexander Pritzel, and Daan Wierstra . 2017. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734 (2017).Google Scholar
- Ross Girshick . 2015. Fast r-cnn Proceedings of the IEEE international conference on computer vision. 1440--1448. Google ScholarDigital Library
- Xavier Glorot and Yoshua Bengio . 2010. Understanding the difficulty of training deep feedforward neural networks Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 249--256.Google Scholar
- Geoffrey Hinton, Oriol Vinyals, and Jeff Dean . 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).Google Scholar
- Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton . 1991. Adaptive mixtures of local experts. Neural computation, Vol. 3, 1 (1991), 79--87.Google Scholar
- Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et almbox. . 2016. Google's multilingual neural machine translation system: enabling zero-shot translation. arXiv preprint arXiv:1611.04558 (2016).Google Scholar
- Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit . 2017. One Model To Learn Them All. arXiv preprint arXiv:1706.05137 (2017).Google Scholar
- Zhuoliang Kang, Kristen Grauman, and Fei Sha . 2011. Learning with whom to share in multi-task feature learning Proceedings of the 28th International Conference on Machine Learning (ICML-11). 521--528. Google ScholarDigital Library
- Diederik Kingma and Jimmy Ba . 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser . 2015. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114 (2015).Google Scholar
- Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert . 2016. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3994--4003.Google ScholarCross Ref
- Xia Ning and George Karypis . 2010. Multi-task learning for recommender system. In Proceedings of 2nd Asian Conference on Machine Learning. 269--284.Google Scholar
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks Advances in neural information processing systems. 91--99. Google ScholarDigital Library
- Sebastian Ruder . 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017).Google Scholar
- Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean . 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017).Google Scholar
- Jasper Snoek, Hugo Larochelle, and Ryan P Adams . 2012. Practical bayesian optimization of machine learning algorithms Advances in neural information processing systems. 2951--2959. Google ScholarDigital Library
- Shengyang Sun, Changyou Chen, and Lawrence Carin . 2017. Learning Structured Weight Uncertainty in Bayesian Neural Networks Artificial Intelligence and Statistics. 1283--1292.Google Scholar
- Yongxin Yang and Timothy Hospedales . 2016. Deep multi-task representation learning: A tensor factorisation approach. arXiv preprint arXiv:1605.06391 (2016).Google Scholar
- Zhe Zhao, Zhiyuan Cheng, Lichan Hong, and Ed H Chi . 2015. Improving user topic interest profiles by behavior factorization Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1406--1416. Google ScholarDigital Library
Index Terms
- Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts
Recommendations
Metric-Guided Multi-task Learning
Foundations of Intelligent SystemsAbstractMulti-task learning (MTL) aims to solve multiple related learning tasks simultaneously so that the useful information in one specific task can be utilized by other tasks in order to improve the learning performance of all tasks. Many ...
Hierarchical Task-aware Multi-Head Attention Network
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information RetrievalNeural Multi-task Learning is gaining popularity as a way to learn multiple tasks jointly within a single model. While related research continues to break new ground, two major limitations still remain, including (i) poor generalization to scenarios ...
Saliency-Regularized Deep Multi-Task Learning
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data MiningMulti-task learning (MTL) is a framework that enforces multiple learning tasks to share their knowledge to improve their generalization abilities. While shallow multi-task learning can learn task relations, it can only handle pre-defined features. ...
Comments