skip to main content
10.1145/3331184.3331186acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Quantifying and Alleviating the Language Prior Problem in Visual Question Answering

Authors Info & Claims
Published:18 July 2019Publication History

ABSTRACT

Benefiting from the advancement of computer vision, natural language processing and information retrieval techniques, visual question answering (VQA), which aims to answer questions about an image or a video, has received lots of attentions over the past few years. Although some progress has been achieved so far, several studies have pointed out that current VQA models are heavily affected by the language prior problem, which means they tend to answer questions based on the co-occurrence patterns of question keywords (e.g., how many) and answers (e.g., 2) instead of understanding images and questions. Existing methods attempt to solve this problem by either balancing the biased datasets or forcing models to better understand images. However, only marginal effects and even performance deterioration are observed for the first and second solution, respectively. In addition, another important issue is the lack of measurement to quantitatively measure the extent of the language prior effect, which severely hinders the advancement of related techniques.

In this paper, we make contributions to solve the above problems from two perspectives. Firstly, we design a metric to quantitatively measure the language prior effect of VQA models. The proposed metric has been demonstrated to be effective in our empirical studies. Secondly, we propose a regularization method (i.e., score regularization module) to enhance current VQA models by alleviating the language prior problem as well as boosting the backbone model performance. The proposed score regularization module adopts a pair-wise learning strategy, which makes the VQA models answer the question based on the reasoning of the image (upon this question) instead of basing on question-answer patterns observed in the biased training set. The score regularization module is flexible to be integrated into various VQA models. We conducted extensive experiments over two popular VQA datasets (i.e., VQA 1.0 and VQA 2.0) and integrated the score regularization module into three state-of-the-art VQA models. Experimental results show that the score regularization module can not only effectively reduce the language prior problem of these VQA models but also consistently improve their question answering accuracy.

Skip Supplemental Material Section

Supplemental Material

cite1-13h30-d1.mp4

mp4

432.3 MB

References

  1. Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. 2016. Analyzing the behavior of visual question answering models. In EMNLP. ACL, 1955--1960.Google ScholarGoogle ScholarCross RefCross Ref
  2. Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don't just assume; look and answer: Overcoming priors for visual question answering. In CVPR. IEEE, 4971--4980.Google ScholarGoogle Scholar
  3. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR. IEEE, 6077--6086.Google ScholarGoogle Scholar
  4. Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural module networks. In CVPR. IEEE, 39--48.Google ScholarGoogle Scholar
  5. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV. IEEE, 2425--2433. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Yalong Bai, Jianlong Fu, Tiejun Zhao, and Tao Mei. 2018. Deep attention neural tensor network for visual question answering. In ECCV. Springer, 21--37.Google ScholarGoogle Scholar
  7. Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention. In SIGIR. ACM, 335--344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Shiqian Chen, Chenliang Li, Feng Ji, Wei Zhou, and Haiqing Chen. 2019. Driven answer generation for product-related questions in e-commerce. In WSDM. ACM, 411--419. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Shiqian Chen, Chenliang Li, Feng Ji, Wei Zhou, and Haiqing Chen. 2019. Driven answer generation for product-related questions in E-commerce. In WSDM. ACM, 411--419. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Yong Cheng, Fei Huang, Lian Zhou, Cheng Jin, Yuejie Zhang, and Tao Zhang. 2017. A hierarchical multimodal attention-based neural network for image captioning. In SIGIR. ACM, 889--892. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Zhiyong Cheng, Xiaojun Chang, Lei Zhu, Rose C. Kanjirathinkal, and Mohan Kankanhalli. 2019. MMALFM: Explainable recommendation by leveraging reviews and images. TOIS, Vol. 37, 2 (2019), 16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Golnoosh Farnadi, Jie Tang, Martine De Cock, and Marie-Francine Moens. 2018. User profiling through deep multimodal fusion. In WSDM. ACM, 171--179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR. IEEE, 6325--6334.Google ScholarGoogle Scholar
  14. Yangyang Guo, Zhiyong Cheng, Liqiang Nie, Xin-Shun Xu, and Mohan Kankanhalli. 2018. Multi-modal preference modeling for product search. In MM. ACM, 1865--1873. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. IEEE, 770--778.Google ScholarGoogle Scholar
  16. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR. IEEE, 1988--1997.Google ScholarGoogle Scholar
  18. Kushal Kafle and Christopher Kanan. 2017. Visual question answering: Datasets, algorithms, and future challenges. CVIU, Vol. 163 (2017), 3--20.Google ScholarGoogle ScholarCross RefCross Ref
  19. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR. IEEE, 3128--3137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Vahid Kazemi and Ali Elqursh. 2017. Show, ask, attend, and answer: A strong baseline for visual question answering. In arXiv preprint arXiv:1704.03162.Google ScholarGoogle Scholar
  21. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata and Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, Vol. 123, 1 (2017), 32--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. Springer, 740--755.Google ScholarGoogle Scholar
  23. Mingsheng Long, Yue Cao, Jianmin Wang, and Philip S. Yu. 2016. Composite correlation quantization for efficient multimodal retrieval. In SIGIR. ACM, 579--588. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In NIPS. 289--297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS. 1682--1690. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015. Ask your neurons: A neural-based approach to answering questions about images. In ICCV. IEEE, 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Duy-Kien Nguyen and Takayuki Okatani. 2018. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In CVPR. IEEE, 6087--6096.Google ScholarGoogle Scholar
  28. Liqiang Nie, Meng Wang, Zhengjun Zha, Guangda Li, and Tat-Seng Chua. 2011. Multimedia answering: enriching text QA with media information. In SIGIR. ACM, 695--704. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Adi Omari, David Carmel, Oleg Rokhlenko, and Idan Szpektor. 2016. Novelty based ranking of human answers for community questions. In SIGIR. ACM, 215--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Marius A Pasca and Sandra M Harabagiu. 2001. High performance question/answering. In SIGIR. ACM, 366--374. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Dhanesh Ramachandram and Graham W. Taylor. 2017. Deep multimodal learning: A survey on recent advances and trends. Signal Processing, Vol. 34, 6 (2017), 96--108.Google ScholarGoogle ScholarCross RefCross Ref
  32. Sainandan Ramakrishnan, Aishwarya Agrawal, and Stefan Lee. 2018. Overcoming language priors in visual question answering with adversarial regularization. In NIPS. 1546--1556. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS. 91--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In NIPS. 568--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR .Google ScholarGoogle Scholar
  36. Di Wu, Lionel Pigou, Pieter-Jan Kindermans, Nam Do-Hoang Le, Ling Shao, Joni Dambre, and Jean-Marc Odobez. 2016. Deep dynamic neural networks for multimodal gesture segmentation and recognition. TPAMI, Vol. 38, 8 (2016), 1583--1597.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2017. Visual question answering: A survey of methods and datasets. CVIU, Vol. 163 (2017), 21--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In CVPR. IEEE, 4622--4630.Google ScholarGoogle Scholar
  39. Huijuan Xu and Kate Saenko. 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In ECCV. Springer, 451--466.Google ScholarGoogle Scholar
  40. Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In CVPR. IEEE, 21--29.Google ScholarGoogle Scholar
  41. Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In ICCV. IEEE, 1839--1848.Google ScholarGoogle ScholarCross RefCross Ref
  42. Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2016. Yin and yang: Balancing and answering binary visual questions. In CVPR. IEEE, 5014--5022.Google ScholarGoogle Scholar
  43. Yan Zhang, Jonathon Hare, and Adam Prügel-Bennett. 2018. Learning to count objects in natural images for visual question answering. In ICLR .Google ScholarGoogle Scholar
  44. Zhou Zhao, Qifan Yang, Hanqing Lu, Min Yang, Jun Xiao, Fei Wu, and Yueting Zhuang. 2017. Learning max-margin geoSocial multimedia network representations for point-of-interest suggestion. In SIGIR. ACM, 833--836. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7w: Grounded question answering in images. In CVPR. IEEE, 4995--5004.Google ScholarGoogle Scholar

Index Terms

  1. Quantifying and Alleviating the Language Prior Problem in Visual Question Answering

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
          July 2019
          1512 pages
          ISBN:9781450361729
          DOI:10.1145/3331184

          Copyright © 2019 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 18 July 2019

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          SIGIR'19 Paper Acceptance Rate84of426submissions,20%Overall Acceptance Rate792of3,983submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader