skip to main content
10.1145/3132847.3132922acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Movie Fill in the Blank with Adaptive Temporal Attention and Description Update

Authors Info & Claims
Published:06 November 2017Publication History

ABSTRACT

Recently, a new type of video understanding task called Movie-Fill-in-the-Blank (MovieFIB) has attracted many research attentions. Given a pair of movie clip and description with one blank word as input, MovieFIB aims to automatically predict the blank word. Because of the advantage in processing sequence data, Long-Short Term Memory (LSTM) has been used as a key component in existing MovieFIB methods to generate representations of videos and descriptions. However, most of these methods fail to emphasize the salient parts of videos. To address this problem, in this paper we propose to use a novel LSTM network called LSTM with Linguistic gate (LSTMwL), which exploits adaptive temporal attention for MovieFIB. Specifically, we first use LSTM to produce video features, which are then used to update the text representation. Finally, we put the updated text into two opposite directional LSTMwL layers to infer the blank word. Experimental results demonstrate that our approach outperforms state-of-the-art models for MovieFIB.

References

  1. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In IEEE International Conference on Computer Vision, ICCV 2015. 2425--2433. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Xinlei Chen and C. Lawrence Zitnick. 2015. Mind's eye: A recurrent visual representation for image caption generation IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015. 2422--2431.Google ScholarGoogle Scholar
  3. Tim Cooijmans, Nicolas Ballas, César Laurent, and Aaron C. Courville. 2016. Recurrent Batch Normalization. CoRR Vol. abs/1603.09025 (2016). http://arxiv.org/abs/1603.09025Google ScholarGoogle Scholar
  4. Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell, and Kate Saenko. 2015. Long-term recurrent convolutional networks for visual recognition and description IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015. 2625--2634.Google ScholarGoogle Scholar
  5. Hao Fang, Saurabh Gupta, Forrest N. Iandola, Rupesh Kumar Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. 2015. From captions to visual concepts and back. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015. 1473--1482.Google ScholarGoogle ScholarCross RefCross Ref
  6. Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding Conference on Empirical Methods in Natural Language Processing, EMNLP 2016. 457--468.Google ScholarGoogle Scholar
  7. Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alexander G. Hauptmann. 2015. DevNet: A Deep Event Network for multimedia event detection and evidence recounting IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015. 2568--2577.Google ScholarGoogle Scholar
  8. Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2013. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition. In IEEE International Conference on Computer Vision, ICCV 2013. 2712--2719. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 770--778.Google ScholarGoogle Scholar
  10. Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In Proceedings of the 26th International Conference on World Wide Web, WWW 2017. 173--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Aiwen Jiang, Fang Wang, Fatih Porikli, and Yi Li. 2015. Compositional Memory for Visual Question Answering. CoRR Vol. abs/1511.05676 (2015). http://arxiv.org/abs/1511.05676Google ScholarGoogle Scholar
  12. Andrej Karpathy, Armand Joulin, and Fei-Fei Li. 2014. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014. 1889--1897. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jin-Hwa Kim, Sang-Woo Lee, Dong-Hyun Kwak, Min-Oh Heo, Jeonghee Kim, JungWoo Ha, and Byoung-Tak Zhang. 2016. Multimodal Residual Learning for Visual QA. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016. 361--369. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. 375--383.Google ScholarGoogle Scholar
  15. Lin Ma, Zhengdong Lu, and Hang Li. 2016. Learning to Answer Questions from Image Using Convolutional Neural Network Thirtieth AAAI Conference on Artificial Intelligence. 3567--3573. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Tegan Maharaj, Nicolas Ballas, Aaron C. Courville, and Christopher Joseph Pal. 2016. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. CoRR Vol. abs/1611.07810 (2016). http://arxiv.org/abs/1611.07810Google ScholarGoogle Scholar
  17. Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015. Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images IEEE International Conference on Computer Vision, ICCV 2015. 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Amir Mazaheri, Dong Zhang, and Mubarak Shah. 2016. Video Fill in the Blank with Merging LSTMs. CoRR Vol. abs/1610.04062 (2016). http://arxiv.org/abs/1610.04062Google ScholarGoogle Scholar
  19. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. CoRR Vol. abs/1301.3781 (2013). http://arxiv.org/abs/1301.3781Google ScholarGoogle Scholar
  20. Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. 2016. Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 30--38.Google ScholarGoogle Scholar
  21. Vignesh Ramanathan, Percy Liang, and Fei-Fei Li. 2013. Video Event Understanding Using Natural Language Descriptions IEEE International Conference on Computer Vision, ICCV 2013. 905--912. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Anna Rohrbach, Marcus Rohrbach, Wei Qiu, Annemarie Friedrich, Manfred Pinkal, and Bernt Schiele. 2014. Coherent Multi-sentence Video Description with Variable Level of Detail Pattern Recognition - 36th German Conference, GCPR 2014, Proceedings. 184--195.Google ScholarGoogle Scholar
  23. Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating Video Content to Natural Language Descriptions IEEE International Conference on Computer Vision, ICCV 2013. 433--440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR Vol. abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556Google ScholarGoogle Scholar
  25. Laure Soulier, Lynda Tamine, and Gia-Hung Nguyen. 2016. Answering Twitter Questions: a Model for Recommending Answerers through Social Collaboration. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016. 267--276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kevin D. Tang, Bangpeng Yao, Fei-Fei Li, and Daphne Koller. 2013. Combining the Right Features for Complex Event Recognition IEEE International Conference on Computer Vision, ICCV 2013. 2696--2703. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2015 a. Sequence to Sequence - Video to Text. In IEEE International Conference on Computer Vision, ICCV 2015. 4534--4542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond J. Mooney, and Kate Saenko. 2015 b. Translating Videos to Natural Language Using Deep Recurrent Neural Networks NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1494--1504.Google ScholarGoogle Scholar
  29. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015. 3156--3164.Google ScholarGoogle ScholarCross RefCross Ref
  30. Di Wang and Eric Nyberg. 2015. A Long Short-Term Memory Model for Answer Sentence Selection in Question Answering Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, Volume 2: Short Papers. 707--712.Google ScholarGoogle Scholar
  31. Pengwei Wang, Lei Ji, Jun Yan, Lianwen Jin, and Wei-Ying Ma. 2016. Learning to Extract Conditional Knowledge for Question Answering using Dialogue Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016. 277--286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic Memory Networks for Visual and Textual Question Answering International Conference on Machine Learning, ICML 2016. 2397--2406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked Attention Networks for Image Question Answering IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 21--29.Google ScholarGoogle ScholarCross RefCross Ref
  34. Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher J. Pal, Hugo Larochelle, and Aaron C. Courville. 2015. Describing Videos by Exploiting Temporal Structure IEEE International Conference on Computer Vision, ICCV 2015. 4507--4515. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2016. Video Captioning and Retrieval Models with Semantic Attention. CoRR Vol. abs/1610.02947 (2016). http://arxiv.org/abs/1610.02947Google ScholarGoogle Scholar
  36. Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, and Min Sun. 2017. Leveraging Video Descriptions to Learn Video Question Answering Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. 4334--4340.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. 2015. Simple Baseline for Visual Question Answering. CoRR Vol. abs/1512.02167 (2015). http://arxiv.org/abs/1512.02167Google ScholarGoogle Scholar
  38. Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2015. Uncovering Temporal Context for Video Question and Answering. CoRR Vol. abs/1511.04670 (2015). http://arxiv.org/abs/1511.04670Google ScholarGoogle Scholar

Index Terms

  1. Movie Fill in the Blank with Adaptive Temporal Attention and Description Update

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
          November 2017
          2604 pages
          ISBN:9781450349185
          DOI:10.1145/3132847

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 6 November 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          CIKM '17 Paper Acceptance Rate171of855submissions,20%Overall Acceptance Rate1,861of8,427submissions,22%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader