skip to main content
10.1145/3343031.3350923acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multimodal Dialog System: Generating Responses via Adaptive Decoders

Authors Info & Claims
Published:15 October 2019Publication History

ABSTRACT

On the shoulders of textual dialog systems, the multimodal ones, recently have engaged increasing attention, especially in the retail domain. Despite the commercial value of multimodal dialog systems, they still suffer from the following challenges: 1) automatically generate the right responses in appropriate medium forms; 2) jointly consider the visual cues and the side information while selecting product images; and 3) guide the response generation with multi-faceted and heterogeneous knowledge. To address the aforementioned issues, we present a Multimodal diAloG system with adaptIve deCoders, MAGIC for short. In particular, MAGIC first judges the response type and the corresponding medium form via understanding the intention of the given multimodal context. Hereafter, it employs adaptive decoders to generate the desired responses: a simple recurrent neural network (RNN) is applied to generating general responses, then a knowledge-aware RNN decoder is designed to encode the multiform domain knowledge to enrich the response, and the multimodal response decoder incorporates an image recommendation model which jointly considers the textual attributes and the visual images via a neural model optimized by the max-margin loss. We comparatively justify MAGIC over a benchmark dataset. Experiment results demonstrate that MAGIC outperforms the existing methods and achieves the state-of-the-art performance.

References

  1. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2425--2433.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473.Google ScholarGoogle Scholar
  3. Antoine Bordes, Y-Lan Boureau, and Jason Weston. 2016. Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683.Google ScholarGoogle Scholar
  4. Zheqian Chen, Rongqin Yang, Zhou Zhao, Deng Cai, and Xiaofei He. 2018. Dialogue act recognition via crf-attentive structured network. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 225--234.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. 2016. Towards conversational recommender systems. In Proceedings of the International Conference on knowledge Discovery and Data Mining. ACM, 815--824.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chen Cui, Wenjie Wang, Xuemeng Song, Minlie Huang, Xin-Shun Xu, and Liqiang Nie. 2019. User Attention-guided Multimodal Dialog Systems. The 42nd International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 445--454.Google ScholarGoogle Scholar
  7. Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2017. Visual Dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1080--1089.Google ScholarGoogle Scholar
  8. Harm De Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron Courville. 2017. Guesswhat?! visual object discovery through multi-modal dialogue. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 5503----5512.Google ScholarGoogle ScholarCross RefCross Ref
  9. Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, and Li Deng. 2017. Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access. In Proceedings of the 55th Annual Meeting of the Association for Computational. ACL, 484--495.Google ScholarGoogle ScholarCross RefCross Ref
  10. Jimmy Lei Ba. Diederik P. Kingma. 2015. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980.Google ScholarGoogle Scholar
  11. George Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the 2nd International Conference on Human Language Technology Research. Morgan Kaufmann Publishers Inc., 138--145.Google ScholarGoogle ScholarCross RefCross Ref
  12. Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and Michel Galley. 2018. A knowledge-grounded neural conversation model. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI Press, 5110--5117.Google ScholarGoogle Scholar
  13. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6325--6334.Google ScholarGoogle ScholarCross RefCross Ref
  14. Sangdo Han, Jeesoo Bang, Seonghan Ryu, and Gary Geunbae Lee. 2015. Exploiting knowledge base to generate responses for natural language dialog listening agents. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. SIGDIAL, 129--133.Google ScholarGoogle ScholarCross RefCross Ref
  15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  16. Wenqiang Lei, Xisen Jin, Min-Yen Kan, Zhaochun Ren, Xiangnan He, and Dawei Yin. 2018. Sequicity: Simplifying task-oriented dialogue systems with single sequence-to-sequence architectures. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 1437--1447.Google ScholarGoogle ScholarCross RefCross Ref
  17. Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A Diversity-Promoting Objective Function for Neural Conversation Models. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technologies. ACL, 110--119.Google ScholarGoogle ScholarCross RefCross Ref
  18. Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz, Vincent Michalski, Laurent Charlin, and Chris Pal. 2018. Towards Deep Conversational Recommendations. In Proceedings of the Neural Information Processing Systems Conference. MIT Press, 9748--9758.Google ScholarGoogle Scholar
  19. Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli cC elikyilmaz. 2017. End-to-End Task-Completion Neural Dialogue Systems. In Proceedings of the 8th International Joint Conference on Natural Language Processing . AFNLP, 733--743.Google ScholarGoogle Scholar
  20. Lizi Liao, Xiangnan He, Bo Zhao, Chong-Wah Ngo, and Tat-Seng Chua. 2018a. Interpretable multimodal retrieval for fashion products. In Proceedings of the ACM Multimedia Conference on Multimedia Conference. ACM, 1571--1579.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, and Tat-seng Chua. 2018b. Knowledge-aware Multimodal Dialogue Systems. In Proceedings of the ACM Multimedia Conference on Multimedia Conference. ACM, 801--809.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Meng Liu, Liqiang Nie, Meng Wang, and Baoquan Chen. 2017. Towards Micro-video Understanding by Joint Sequential-Sparse Modeling. In Proceedings of the 25th ACM International Conference on Multimedia. ACM, 970--978.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018. Cross-modal Moment Localization in Videos. In Proceedings of the 26th ACM International Conference on Multimedia. ACM, 843--851.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. [n. d.]. Hierarchical question-image co-attention for visual question answering. In Proceedings of the Neural Information Processing Systems Conference. MIT Press.Google ScholarGoogle Scholar
  25. Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. 2017. Coherent Dialogue with Attention-Based Language Models. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI Press, 3252--3258.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Alexander H. Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. 2016. Key-Value Memory Networks for Directly Reading Documents. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. ACL, 1400--1409.Google ScholarGoogle ScholarCross RefCross Ref
  27. Liqiang Nie, Xuemeng Song, and Tat-Seng Chua. 2016. Learning from multiple social networks. Synthesis Lectures on Information Concepts, Retrieval, and Services , Vol. 8, 2 (2016), 1--118.Google ScholarGoogle ScholarCross RefCross Ref
  28. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of Annual Meeting of the Association for Computational Linguistics. ACL, 311--318.Google ScholarGoogle Scholar
  29. Lina Maria Rojas-Barahona, Milica Gasic, Nikola Mrksic, Pei-Hao Su, Stefan Ultes, Tsung-Hsien Wen, Steve J. Young, and David Vandyke. 2017. A Network-based End-to-End Trainable Task-oriented Dialogue System. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics . ACL, 438--449.Google ScholarGoogle Scholar
  30. Amrita Saha, Mitesh M Khapra, and Karthik Sankaranarayanan. 2018. Towards Building Large Scale Multimodal Domain-Aware Conversation Systems. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence . AAAI Press.Google ScholarGoogle Scholar
  31. Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and Joelle Pineau. 2016. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models. In Proceedings of the 30th AAAI Conference on Artificial Intelligence . AAAI Press, 3776--3784.Google ScholarGoogle Scholar
  32. Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et almbox. 2015. End-to-end memory networks. In Proceedings of the Neural Information Processing Systems Conference. MIT Press, 2440--2448.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Yueming Sun and Yi Zhang. 2018. Conversational Recommender System. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 235--244.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Proceedings of the Neural Information Processing Systems Conference. MIT Press, 3104--3112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Wenjie Wang, Minlie Huang, Xin-Shun Xu, Fumin Shen, and Liqiang Nie. 2018. Chat more: Deepening and widening the chatting topic via a deep model. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval . ACM, 255--264.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jason D. Williams, Antoine Raux, Deepak Ramachandran, and Alan W. Blac. 2013. The dialog state tracking challenge. In Proceedings of the SIGDIAL Conference on Discourse and Dialogue. SIGDIAL, 404--413.Google ScholarGoogle Scholar
  37. Jason D. Williams and Geoffrey Zweig. 2016. End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning. arXiv preprint arXiv:1606.01269.Google ScholarGoogle Scholar
  38. Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhoujun Li. 2017. Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 496--505.Google ScholarGoogle ScholarCross RefCross Ref
  39. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of International conference on machine learning . JMLR.org, 2048--2057.Google ScholarGoogle Scholar
  40. Zhen Xu, Bingquan Liu, Baoxun Wang, Chengjie Sun, and Xiaolong Wang. 2017. Incorporating loose-structured knowledge into conversation modeling via recall-gate LSTM. In Proceedings of the International Joint Conference on Neural Networks . INNS, 3506--3513.Google ScholarGoogle ScholarCross RefCross Ref
  41. Rui Yan, Yiping Song, and Hua Wu. 2016. Learning to Respond with Deep Neural Networks for Retrieval-Based Human-Computer Conversation System. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 55--64.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Rui Yan, Dongyan Zhao, and Weinan E. 2017. Joint Learning of Response Ranking and Next Utterance Suggestion in Human-Computer Conversation System. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 685--694.Google ScholarGoogle Scholar
  43. Liu Yang, Minghui Qiu, Chen Qu, Jiafeng Guo, Yongfeng Zhang, W. Bruce Croft, Jun Huang, and Haiqing Chen. 2018. Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 245--254.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Kaisheng Yao, Geoffrey Zweig, and Baolin Peng. 2015. Attention with Intention for a Neural Network Conversation Model. arXiv preprint arXiv:1510.08565.Google ScholarGoogle Scholar

Index Terms

  1. Multimodal Dialog System: Generating Responses via Adaptive Decoders

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '19: Proceedings of the 27th ACM International Conference on Multimedia
        October 2019
        2794 pages
        ISBN:9781450368896
        DOI:10.1145/3343031

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 October 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        MM '19 Paper Acceptance Rate252of936submissions,27%Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader