research-article

Multimodal Dialog System: Generating Responses via Adaptive Decoders

Authors:
Liqiang Nie

Shandong University, Qingdao, China

Shandong University, Qingdao, China
View Profile

,
Wenjie Wang

Shandong University, Qingdao, China

Shandong University, Qingdao, China
View Profile

,
Richang Hong

Hefei University of Technology, Hefei, China, Hefei, China

Hefei University of Technology, Hefei, China, Hefei, China
View Profile

,
Meng Wang

Hefei University of Technology, Hefei, China, Hefei, China

Hefei University of Technology, Hefei, China, Hefei, China
View Profile

,
Qi Tian

Noah's Ark Lab, Huawei, Shenzhen, China

Noah's Ark Lab, Huawei, Shenzhen, China
View Profile

MM '19: Proceedings of the 27th ACM International Conference on MultimediaOctober 2019Pages 1098–1106https://doi.org/10.1145/3343031.3350923

Published:15 October 2019Publication History

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 1098–1106

ABSTRACT

On the shoulders of textual dialog systems, the multimodal ones, recently have engaged increasing attention, especially in the retail domain. Despite the commercial value of multimodal dialog systems, they still suffer from the following challenges: 1) automatically generate the right responses in appropriate medium forms; 2) jointly consider the visual cues and the side information while selecting product images; and 3) guide the response generation with multi-faceted and heterogeneous knowledge. To address the aforementioned issues, we present a Multimodal diAloG system with adaptIve deCoders, MAGIC for short. In particular, MAGIC first judges the response type and the corresponding medium form via understanding the intention of the given multimodal context. Hereafter, it employs adaptive decoders to generate the desired responses: a simple recurrent neural network (RNN) is applied to generating general responses, then a knowledge-aware RNN decoder is designed to encode the multiform domain knowledge to enrich the response, and the multimodal response decoder incorporates an image recommendation model which jointly considers the textual attributes and the visual images via a neural model optimized by the max-margin loss. We comparatively justify MAGIC over a benchmark dataset. Experiment results demonstrate that MAGIC outperforms the existing methods and achieves the state-of-the-art performance.

References

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2425--2433.Google ScholarDigital Library
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473.Google Scholar
Antoine Bordes, Y-Lan Boureau, and Jason Weston. 2016. Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683.Google Scholar
Zheqian Chen, Rongqin Yang, Zhou Zhao, Deng Cai, and Xiaofei He. 2018. Dialogue act recognition via crf-attentive structured network. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 225--234.Google ScholarDigital Library
Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. 2016. Towards conversational recommender systems. In Proceedings of the International Conference on knowledge Discovery and Data Mining. ACM, 815--824.Google ScholarDigital Library
Chen Cui, Wenjie Wang, Xuemeng Song, Minlie Huang, Xin-Shun Xu, and Liqiang Nie. 2019. User Attention-guided Multimodal Dialog Systems. The 42nd International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 445--454.Google Scholar
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2017. Visual Dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1080--1089.Google Scholar
Harm De Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron Courville. 2017. Guesswhat?! visual object discovery through multi-modal dialogue. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 5503----5512.Google ScholarCross Ref
Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, and Li Deng. 2017. Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access. In Proceedings of the 55th Annual Meeting of the Association for Computational. ACL, 484--495.Google ScholarCross Ref
Jimmy Lei Ba. Diederik P. Kingma. 2015. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980.Google Scholar
George Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the 2nd International Conference on Human Language Technology Research. Morgan Kaufmann Publishers Inc., 138--145.Google ScholarCross Ref
Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and Michel Galley. 2018. A knowledge-grounded neural conversation model. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI Press, 5110--5117.Google Scholar
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6325--6334.Google ScholarCross Ref
Sangdo Han, Jeesoo Bang, Seonghan Ryu, and Gary Geunbae Lee. 2015. Exploiting knowledge base to generate responses for natural language dialog listening agents. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. SIGDIAL, 129--133.Google ScholarCross Ref
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 770--778.Google ScholarCross Ref
Wenqiang Lei, Xisen Jin, Min-Yen Kan, Zhaochun Ren, Xiangnan He, and Dawei Yin. 2018. Sequicity: Simplifying task-oriented dialogue systems with single sequence-to-sequence architectures. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 1437--1447.Google ScholarCross Ref
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A Diversity-Promoting Objective Function for Neural Conversation Models. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technologies. ACL, 110--119.Google ScholarCross Ref
Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz, Vincent Michalski, Laurent Charlin, and Chris Pal. 2018. Towards Deep Conversational Recommendations. In Proceedings of the Neural Information Processing Systems Conference. MIT Press, 9748--9758.Google Scholar
Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli cC elikyilmaz. 2017. End-to-End Task-Completion Neural Dialogue Systems. In Proceedings of the 8th International Joint Conference on Natural Language Processing . AFNLP, 733--743.Google Scholar
Lizi Liao, Xiangnan He, Bo Zhao, Chong-Wah Ngo, and Tat-Seng Chua. 2018a. Interpretable multimodal retrieval for fashion products. In Proceedings of the ACM Multimedia Conference on Multimedia Conference. ACM, 1571--1579.Google ScholarDigital Library
Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, and Tat-seng Chua. 2018b. Knowledge-aware Multimodal Dialogue Systems. In Proceedings of the ACM Multimedia Conference on Multimedia Conference. ACM, 801--809.Google ScholarDigital Library
Meng Liu, Liqiang Nie, Meng Wang, and Baoquan Chen. 2017. Towards Micro-video Understanding by Joint Sequential-Sparse Modeling. In Proceedings of the 25th ACM International Conference on Multimedia. ACM, 970--978.Google ScholarDigital Library
Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018. Cross-modal Moment Localization in Videos. In Proceedings of the 26th ACM International Conference on Multimedia. ACM, 843--851.Google ScholarDigital Library
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. [n. d.]. Hierarchical question-image co-attention for visual question answering. In Proceedings of the Neural Information Processing Systems Conference. MIT Press.Google Scholar
Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. 2017. Coherent Dialogue with Attention-Based Language Models. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI Press, 3252--3258.Google ScholarDigital Library
Alexander H. Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. 2016. Key-Value Memory Networks for Directly Reading Documents. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. ACL, 1400--1409.Google ScholarCross Ref
Liqiang Nie, Xuemeng Song, and Tat-Seng Chua. 2016. Learning from multiple social networks. Synthesis Lectures on Information Concepts, Retrieval, and Services , Vol. 8, 2 (2016), 1--118.Google ScholarCross Ref
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of Annual Meeting of the Association for Computational Linguistics. ACL, 311--318.Google Scholar
Lina Maria Rojas-Barahona, Milica Gasic, Nikola Mrksic, Pei-Hao Su, Stefan Ultes, Tsung-Hsien Wen, Steve J. Young, and David Vandyke. 2017. A Network-based End-to-End Trainable Task-oriented Dialogue System. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics . ACL, 438--449.Google Scholar
Amrita Saha, Mitesh M Khapra, and Karthik Sankaranarayanan. 2018. Towards Building Large Scale Multimodal Domain-Aware Conversation Systems. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence . AAAI Press.Google Scholar
Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and Joelle Pineau. 2016. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models. In Proceedings of the 30th AAAI Conference on Artificial Intelligence . AAAI Press, 3776--3784.Google Scholar
Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et almbox. 2015. End-to-end memory networks. In Proceedings of the Neural Information Processing Systems Conference. MIT Press, 2440--2448.Google ScholarDigital Library
Yueming Sun and Yi Zhang. 2018. Conversational Recommender System. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 235--244.Google ScholarDigital Library
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Proceedings of the Neural Information Processing Systems Conference. MIT Press, 3104--3112.Google ScholarDigital Library
Wenjie Wang, Minlie Huang, Xin-Shun Xu, Fumin Shen, and Liqiang Nie. 2018. Chat more: Deepening and widening the chatting topic via a deep model. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval . ACM, 255--264.Google ScholarDigital Library
Jason D. Williams, Antoine Raux, Deepak Ramachandran, and Alan W. Blac. 2013. The dialog state tracking challenge. In Proceedings of the SIGDIAL Conference on Discourse and Dialogue. SIGDIAL, 404--413.Google Scholar
Jason D. Williams and Geoffrey Zweig. 2016. End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning. arXiv preprint arXiv:1606.01269.Google Scholar
Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhoujun Li. 2017. Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 496--505.Google ScholarCross Ref
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of International conference on machine learning . JMLR.org, 2048--2057.Google Scholar
Zhen Xu, Bingquan Liu, Baoxun Wang, Chengjie Sun, and Xiaolong Wang. 2017. Incorporating loose-structured knowledge into conversation modeling via recall-gate LSTM. In Proceedings of the International Joint Conference on Neural Networks . INNS, 3506--3513.Google ScholarCross Ref
Rui Yan, Yiping Song, and Hua Wu. 2016. Learning to Respond with Deep Neural Networks for Retrieval-Based Human-Computer Conversation System. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 55--64.Google ScholarDigital Library
Rui Yan, Dongyan Zhao, and Weinan E. 2017. Joint Learning of Response Ranking and Next Utterance Suggestion in Human-Computer Conversation System. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 685--694.Google Scholar
Liu Yang, Minghui Qiu, Chen Qu, Jiafeng Guo, Yongfeng Zhang, W. Bruce Croft, Jun Huang, and Haiqing Chen. 2018. Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 245--254.Google ScholarDigital Library
Kaisheng Yao, Geoffrey Zweig, and Baolin Peng. 2015. Attention with Intention for a Neural Network Conversation Model. arXiv preprint arXiv:1510.08565.Google Scholar

Index Terms

Multimodal Dialog System: Generating Responses via Adaptive Decoders
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Discourse, dialogue and pragmatics
      2. Natural language generation

Recommendations

Aspect-Aware Response Generation for Multimodal Dialogue System
Survey Paper and Regular Paper

Multimodality in dialogue systems has opened up new frontiers for the creation of robust conversational agents. Any multimodal system aims at bridging the gap between language and vision by leveraging diverse and often complementary information from ...
Read More
User Attention-guided Multimodal Dialog Systems
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

As an intelligent way to interact with computers, the dialog system has been catching more and more attention. However, most research efforts only focus on text-based dialog systems, completely ignoring the rich semantics conveyed by the visual cues. ...
Read More
Conversational Grounding in Multimodal Dialog Systems
ICMI '23: Proceedings of the 25th International Conference on Multimodal Interaction

The process of “conversational grounding” is an interactive process that has been studied extensively in cognitive science, whereby participants in a conversation check to make sure their interlocutors understand what is being referred to. This ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 October 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
adaptive decoders
multiform knowledge-aware decoder
multimodal dialog systems
Qualifiers
- research-article
Conference

Acceptance Rates
MM '19 Paper Acceptance Rate252of936submissions,27%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 42
  Total Citations
  View Citations
- 1,025
  Total Downloads
- Downloads (Last 12 months)103
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multimodal Dialog System: Generating Responses via Adaptive Decoders

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Aspect-Aware Response Generation for Multimodal Dialogue System

User Attention-guided Multimodal Dialog Systems

Conversational Grounding in Multimodal Dialog Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Multimodal Dialog System: Generating Responses via Adaptive Decoders

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Aspect-Aware Response Generation for Multimodal Dialogue System

User Attention-guided Multimodal Dialog Systems

Conversational Grounding in Multimodal Dialog Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media