ABSTRACT
On the shoulders of textual dialog systems, the multimodal ones, recently have engaged increasing attention, especially in the retail domain. Despite the commercial value of multimodal dialog systems, they still suffer from the following challenges: 1) automatically generate the right responses in appropriate medium forms; 2) jointly consider the visual cues and the side information while selecting product images; and 3) guide the response generation with multi-faceted and heterogeneous knowledge. To address the aforementioned issues, we present a Multimodal diAloG system with adaptIve deCoders, MAGIC for short. In particular, MAGIC first judges the response type and the corresponding medium form via understanding the intention of the given multimodal context. Hereafter, it employs adaptive decoders to generate the desired responses: a simple recurrent neural network (RNN) is applied to generating general responses, then a knowledge-aware RNN decoder is designed to encode the multiform domain knowledge to enrich the response, and the multimodal response decoder incorporates an image recommendation model which jointly considers the textual attributes and the visual images via a neural model optimized by the max-margin loss. We comparatively justify MAGIC over a benchmark dataset. Experiment results demonstrate that MAGIC outperforms the existing methods and achieves the state-of-the-art performance.
- Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2425--2433.Google ScholarDigital Library
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473.Google Scholar
- Antoine Bordes, Y-Lan Boureau, and Jason Weston. 2016. Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683.Google Scholar
- Zheqian Chen, Rongqin Yang, Zhou Zhao, Deng Cai, and Xiaofei He. 2018. Dialogue act recognition via crf-attentive structured network. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 225--234.Google ScholarDigital Library
- Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. 2016. Towards conversational recommender systems. In Proceedings of the International Conference on knowledge Discovery and Data Mining. ACM, 815--824.Google ScholarDigital Library
- Chen Cui, Wenjie Wang, Xuemeng Song, Minlie Huang, Xin-Shun Xu, and Liqiang Nie. 2019. User Attention-guided Multimodal Dialog Systems. The 42nd International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 445--454.Google Scholar
- Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2017. Visual Dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1080--1089.Google Scholar
- Harm De Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron Courville. 2017. Guesswhat?! visual object discovery through multi-modal dialogue. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 5503----5512.Google ScholarCross Ref
- Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, and Li Deng. 2017. Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access. In Proceedings of the 55th Annual Meeting of the Association for Computational. ACL, 484--495.Google ScholarCross Ref
- Jimmy Lei Ba. Diederik P. Kingma. 2015. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980.Google Scholar
- George Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the 2nd International Conference on Human Language Technology Research. Morgan Kaufmann Publishers Inc., 138--145.Google ScholarCross Ref
- Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and Michel Galley. 2018. A knowledge-grounded neural conversation model. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI Press, 5110--5117.Google Scholar
- Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6325--6334.Google ScholarCross Ref
- Sangdo Han, Jeesoo Bang, Seonghan Ryu, and Gary Geunbae Lee. 2015. Exploiting knowledge base to generate responses for natural language dialog listening agents. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. SIGDIAL, 129--133.Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 770--778.Google ScholarCross Ref
- Wenqiang Lei, Xisen Jin, Min-Yen Kan, Zhaochun Ren, Xiangnan He, and Dawei Yin. 2018. Sequicity: Simplifying task-oriented dialogue systems with single sequence-to-sequence architectures. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 1437--1447.Google ScholarCross Ref
- Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A Diversity-Promoting Objective Function for Neural Conversation Models. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technologies. ACL, 110--119.Google ScholarCross Ref
- Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz, Vincent Michalski, Laurent Charlin, and Chris Pal. 2018. Towards Deep Conversational Recommendations. In Proceedings of the Neural Information Processing Systems Conference. MIT Press, 9748--9758.Google Scholar
- Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli cC elikyilmaz. 2017. End-to-End Task-Completion Neural Dialogue Systems. In Proceedings of the 8th International Joint Conference on Natural Language Processing . AFNLP, 733--743.Google Scholar
- Lizi Liao, Xiangnan He, Bo Zhao, Chong-Wah Ngo, and Tat-Seng Chua. 2018a. Interpretable multimodal retrieval for fashion products. In Proceedings of the ACM Multimedia Conference on Multimedia Conference. ACM, 1571--1579.Google ScholarDigital Library
- Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, and Tat-seng Chua. 2018b. Knowledge-aware Multimodal Dialogue Systems. In Proceedings of the ACM Multimedia Conference on Multimedia Conference. ACM, 801--809.Google ScholarDigital Library
- Meng Liu, Liqiang Nie, Meng Wang, and Baoquan Chen. 2017. Towards Micro-video Understanding by Joint Sequential-Sparse Modeling. In Proceedings of the 25th ACM International Conference on Multimedia. ACM, 970--978.Google ScholarDigital Library
- Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018. Cross-modal Moment Localization in Videos. In Proceedings of the 26th ACM International Conference on Multimedia. ACM, 843--851.Google ScholarDigital Library
- Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. [n. d.]. Hierarchical question-image co-attention for visual question answering. In Proceedings of the Neural Information Processing Systems Conference. MIT Press.Google Scholar
- Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. 2017. Coherent Dialogue with Attention-Based Language Models. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI Press, 3252--3258.Google ScholarDigital Library
- Alexander H. Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. 2016. Key-Value Memory Networks for Directly Reading Documents. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. ACL, 1400--1409.Google ScholarCross Ref
- Liqiang Nie, Xuemeng Song, and Tat-Seng Chua. 2016. Learning from multiple social networks. Synthesis Lectures on Information Concepts, Retrieval, and Services , Vol. 8, 2 (2016), 1--118.Google ScholarCross Ref
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of Annual Meeting of the Association for Computational Linguistics. ACL, 311--318.Google Scholar
- Lina Maria Rojas-Barahona, Milica Gasic, Nikola Mrksic, Pei-Hao Su, Stefan Ultes, Tsung-Hsien Wen, Steve J. Young, and David Vandyke. 2017. A Network-based End-to-End Trainable Task-oriented Dialogue System. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics . ACL, 438--449.Google Scholar
- Amrita Saha, Mitesh M Khapra, and Karthik Sankaranarayanan. 2018. Towards Building Large Scale Multimodal Domain-Aware Conversation Systems. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence . AAAI Press.Google Scholar
- Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and Joelle Pineau. 2016. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models. In Proceedings of the 30th AAAI Conference on Artificial Intelligence . AAAI Press, 3776--3784.Google Scholar
- Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et almbox. 2015. End-to-end memory networks. In Proceedings of the Neural Information Processing Systems Conference. MIT Press, 2440--2448.Google ScholarDigital Library
- Yueming Sun and Yi Zhang. 2018. Conversational Recommender System. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 235--244.Google ScholarDigital Library
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Proceedings of the Neural Information Processing Systems Conference. MIT Press, 3104--3112.Google ScholarDigital Library
- Wenjie Wang, Minlie Huang, Xin-Shun Xu, Fumin Shen, and Liqiang Nie. 2018. Chat more: Deepening and widening the chatting topic via a deep model. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval . ACM, 255--264.Google ScholarDigital Library
- Jason D. Williams, Antoine Raux, Deepak Ramachandran, and Alan W. Blac. 2013. The dialog state tracking challenge. In Proceedings of the SIGDIAL Conference on Discourse and Dialogue. SIGDIAL, 404--413.Google Scholar
- Jason D. Williams and Geoffrey Zweig. 2016. End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning. arXiv preprint arXiv:1606.01269.Google Scholar
- Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhoujun Li. 2017. Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 496--505.Google ScholarCross Ref
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of International conference on machine learning . JMLR.org, 2048--2057.Google Scholar
- Zhen Xu, Bingquan Liu, Baoxun Wang, Chengjie Sun, and Xiaolong Wang. 2017. Incorporating loose-structured knowledge into conversation modeling via recall-gate LSTM. In Proceedings of the International Joint Conference on Neural Networks . INNS, 3506--3513.Google ScholarCross Ref
- Rui Yan, Yiping Song, and Hua Wu. 2016. Learning to Respond with Deep Neural Networks for Retrieval-Based Human-Computer Conversation System. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 55--64.Google ScholarDigital Library
- Rui Yan, Dongyan Zhao, and Weinan E. 2017. Joint Learning of Response Ranking and Next Utterance Suggestion in Human-Computer Conversation System. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 685--694.Google Scholar
- Liu Yang, Minghui Qiu, Chen Qu, Jiafeng Guo, Yongfeng Zhang, W. Bruce Croft, Jun Huang, and Haiqing Chen. 2018. Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 245--254.Google ScholarDigital Library
- Kaisheng Yao, Geoffrey Zweig, and Baolin Peng. 2015. Attention with Intention for a Neural Network Conversation Model. arXiv preprint arXiv:1510.08565.Google Scholar
Index Terms
- Multimodal Dialog System: Generating Responses via Adaptive Decoders
Recommendations
Aspect-Aware Response Generation for Multimodal Dialogue System
Survey Paper and Regular PaperMultimodality in dialogue systems has opened up new frontiers for the creation of robust conversational agents. Any multimodal system aims at bridging the gap between language and vision by leveraging diverse and often complementary information from ...
User Attention-guided Multimodal Dialog Systems
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information RetrievalAs an intelligent way to interact with computers, the dialog system has been catching more and more attention. However, most research efforts only focus on text-based dialog systems, completely ignoring the rich semantics conveyed by the visual cues. ...
Conversational Grounding in Multimodal Dialog Systems
ICMI '23: Proceedings of the 25th International Conference on Multimodal InteractionThe process of “conversational grounding” is an interactive process that has been studied extensively in cognitive science, whereby participants in a conversation check to make sure their interlocutors understand what is being referred to. This ...
Comments