ABSTRACT
Recently, a new type of video understanding task called Movie-Fill-in-the-Blank (MovieFIB) has attracted many research attentions. Given a pair of movie clip and description with one blank word as input, MovieFIB aims to automatically predict the blank word. Because of the advantage in processing sequence data, Long-Short Term Memory (LSTM) has been used as a key component in existing MovieFIB methods to generate representations of videos and descriptions. However, most of these methods fail to emphasize the salient parts of videos. To address this problem, in this paper we propose to use a novel LSTM network called LSTM with Linguistic gate (LSTMwL), which exploits adaptive temporal attention for MovieFIB. Specifically, we first use LSTM to produce video features, which are then used to update the text representation. Finally, we put the updated text into two opposite directional LSTMwL layers to infer the blank word. Experimental results demonstrate that our approach outperforms state-of-the-art models for MovieFIB.
- Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In IEEE International Conference on Computer Vision, ICCV 2015. 2425--2433. Google ScholarDigital Library
- Xinlei Chen and C. Lawrence Zitnick. 2015. Mind's eye: A recurrent visual representation for image caption generation IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015. 2422--2431.Google Scholar
- Tim Cooijmans, Nicolas Ballas, César Laurent, and Aaron C. Courville. 2016. Recurrent Batch Normalization. CoRR Vol. abs/1603.09025 (2016). http://arxiv.org/abs/1603.09025Google Scholar
- Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell, and Kate Saenko. 2015. Long-term recurrent convolutional networks for visual recognition and description IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015. 2625--2634.Google Scholar
- Hao Fang, Saurabh Gupta, Forrest N. Iandola, Rupesh Kumar Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. 2015. From captions to visual concepts and back. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015. 1473--1482.Google ScholarCross Ref
- Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding Conference on Empirical Methods in Natural Language Processing, EMNLP 2016. 457--468.Google Scholar
- Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alexander G. Hauptmann. 2015. DevNet: A Deep Event Network for multimedia event detection and evidence recounting IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015. 2568--2577.Google Scholar
- Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2013. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition. In IEEE International Conference on Computer Vision, ICCV 2013. 2712--2719. Google ScholarDigital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 770--778.Google Scholar
- Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In Proceedings of the 26th International Conference on World Wide Web, WWW 2017. 173--182. Google ScholarDigital Library
- Aiwen Jiang, Fang Wang, Fatih Porikli, and Yi Li. 2015. Compositional Memory for Visual Question Answering. CoRR Vol. abs/1511.05676 (2015). http://arxiv.org/abs/1511.05676Google Scholar
- Andrej Karpathy, Armand Joulin, and Fei-Fei Li. 2014. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014. 1889--1897. Google ScholarDigital Library
- Jin-Hwa Kim, Sang-Woo Lee, Dong-Hyun Kwak, Min-Oh Heo, Jeonghee Kim, JungWoo Ha, and Byoung-Tak Zhang. 2016. Multimodal Residual Learning for Visual QA. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016. 361--369. Google ScholarDigital Library
- Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. 375--383.Google Scholar
- Lin Ma, Zhengdong Lu, and Hang Li. 2016. Learning to Answer Questions from Image Using Convolutional Neural Network Thirtieth AAAI Conference on Artificial Intelligence. 3567--3573. Google ScholarDigital Library
- Tegan Maharaj, Nicolas Ballas, Aaron C. Courville, and Christopher Joseph Pal. 2016. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. CoRR Vol. abs/1611.07810 (2016). http://arxiv.org/abs/1611.07810Google Scholar
- Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015. Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images IEEE International Conference on Computer Vision, ICCV 2015. 1--9. Google ScholarDigital Library
- Amir Mazaheri, Dong Zhang, and Mubarak Shah. 2016. Video Fill in the Blank with Merging LSTMs. CoRR Vol. abs/1610.04062 (2016). http://arxiv.org/abs/1610.04062Google Scholar
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. CoRR Vol. abs/1301.3781 (2013). http://arxiv.org/abs/1301.3781Google Scholar
- Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. 2016. Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 30--38.Google Scholar
- Vignesh Ramanathan, Percy Liang, and Fei-Fei Li. 2013. Video Event Understanding Using Natural Language Descriptions IEEE International Conference on Computer Vision, ICCV 2013. 905--912. Google ScholarDigital Library
- Anna Rohrbach, Marcus Rohrbach, Wei Qiu, Annemarie Friedrich, Manfred Pinkal, and Bernt Schiele. 2014. Coherent Multi-sentence Video Description with Variable Level of Detail Pattern Recognition - 36th German Conference, GCPR 2014, Proceedings. 184--195.Google Scholar
- Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating Video Content to Natural Language Descriptions IEEE International Conference on Computer Vision, ICCV 2013. 433--440. Google ScholarDigital Library
- Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR Vol. abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556Google Scholar
- Laure Soulier, Lynda Tamine, and Gia-Hung Nguyen. 2016. Answering Twitter Questions: a Model for Recommending Answerers through Social Collaboration. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016. 267--276. Google ScholarDigital Library
- Kevin D. Tang, Bangpeng Yao, Fei-Fei Li, and Daphne Koller. 2013. Combining the Right Features for Complex Event Recognition IEEE International Conference on Computer Vision, ICCV 2013. 2696--2703. Google ScholarDigital Library
- Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2015 a. Sequence to Sequence - Video to Text. In IEEE International Conference on Computer Vision, ICCV 2015. 4534--4542. Google ScholarDigital Library
- Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond J. Mooney, and Kate Saenko. 2015 b. Translating Videos to Natural Language Using Deep Recurrent Neural Networks NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1494--1504.Google Scholar
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015. 3156--3164.Google ScholarCross Ref
- Di Wang and Eric Nyberg. 2015. A Long Short-Term Memory Model for Answer Sentence Selection in Question Answering Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, Volume 2: Short Papers. 707--712.Google Scholar
- Pengwei Wang, Lei Ji, Jun Yan, Lianwen Jin, and Wei-Ying Ma. 2016. Learning to Extract Conditional Knowledge for Question Answering using Dialogue Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016. 277--286. Google ScholarDigital Library
- Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic Memory Networks for Visual and Textual Question Answering International Conference on Machine Learning, ICML 2016. 2397--2406. Google ScholarDigital Library
- Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked Attention Networks for Image Question Answering IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 21--29.Google ScholarCross Ref
- Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher J. Pal, Hugo Larochelle, and Aaron C. Courville. 2015. Describing Videos by Exploiting Temporal Structure IEEE International Conference on Computer Vision, ICCV 2015. 4507--4515. Google ScholarDigital Library
- Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2016. Video Captioning and Retrieval Models with Semantic Attention. CoRR Vol. abs/1610.02947 (2016). http://arxiv.org/abs/1610.02947Google Scholar
- Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, and Min Sun. 2017. Leveraging Video Descriptions to Learn Video Question Answering Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. 4334--4340.Google ScholarDigital Library
- Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. 2015. Simple Baseline for Visual Question Answering. CoRR Vol. abs/1512.02167 (2015). http://arxiv.org/abs/1512.02167Google Scholar
- Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2015. Uncovering Temporal Context for Video Question and Answering. CoRR Vol. abs/1511.04670 (2015). http://arxiv.org/abs/1511.04670Google Scholar
Index Terms
- Movie Fill in the Blank with Adaptive Temporal Attention and Description Update
Recommendations
Multi-Cast Attention Networks
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningAttention is typically used to select informative sub-phrases that are used for prediction. This paper investigates the novel use of attention as a form of feature augmentation, i.e, casted attention. We propose Multi-Cast Attention Networks (MCAN), a ...
Video Description with Spatial-Temporal Attention
MM '17: Proceedings of the 25th ACM international conference on MultimediaTemporal attention has been widely used in video description to adaptively focus on important frames. However, most existing methods based on temporal attention suffer from the problems of recognition error and detail missing, because only coarse frame-...
Movie Description
Audio description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer ...
Comments