research-article

Movie Fill in the Blank with Adaptive Temporal Attention and Description Update

Authors:
Jie Chen

University of Electronic Science and Technology of China, Chengdu, China

University of Electronic Science and Technology of China, Chengdu, China
View Profile

,
Jie Shao

University of Electronic Science and Technology of China, Chengdu, China

University of Electronic Science and Technology of China, Chengdu, China
View Profile

,
Fumin Shen

University of Electronic Science and Technology of China, Chengdu, China

University of Electronic Science and Technology of China, Chengdu, China
View Profile

,
Chengkun He

University of Electronic Science and Technology of China, Chengdu, China

University of Electronic Science and Technology of China, Chengdu, China
View Profile

,
Lianli Gao

University of Electronic Science and Technology of China, Chengdu, China

University of Electronic Science and Technology of China, Chengdu, China
View Profile

,
Heng Tao Shen

University of Electronic Science and Technology of China, Chengdu, China

University of Electronic Science and Technology of China, Chengdu, China
View Profile

CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge ManagementNovember 2017Pages 1039–1048https://doi.org/10.1145/3132847.3132922

Published:06 November 2017Publication History

CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pages 1039–1048

ABSTRACT

Recently, a new type of video understanding task called Movie-Fill-in-the-Blank (MovieFIB) has attracted many research attentions. Given a pair of movie clip and description with one blank word as input, MovieFIB aims to automatically predict the blank word. Because of the advantage in processing sequence data, Long-Short Term Memory (LSTM) has been used as a key component in existing MovieFIB methods to generate representations of videos and descriptions. However, most of these methods fail to emphasize the salient parts of videos. To address this problem, in this paper we propose to use a novel LSTM network called LSTM with Linguistic gate (LSTMwL), which exploits adaptive temporal attention for MovieFIB. Specifically, we first use LSTM to produce video features, which are then used to update the text representation. Finally, we put the updated text into two opposite directional LSTMwL layers to infer the blank word. Experimental results demonstrate that our approach outperforms state-of-the-art models for MovieFIB.

References

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In IEEE International Conference on Computer Vision, ICCV 2015. 2425--2433. Google ScholarDigital Library
Xinlei Chen and C. Lawrence Zitnick. 2015. Mind's eye: A recurrent visual representation for image caption generation IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015. 2422--2431.Google Scholar
Tim Cooijmans, Nicolas Ballas, César Laurent, and Aaron C. Courville. 2016. Recurrent Batch Normalization. CoRR Vol. abs/1603.09025 (2016). http://arxiv.org/abs/1603.09025Google Scholar
Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell, and Kate Saenko. 2015. Long-term recurrent convolutional networks for visual recognition and description IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015. 2625--2634.Google Scholar
Hao Fang, Saurabh Gupta, Forrest N. Iandola, Rupesh Kumar Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. 2015. From captions to visual concepts and back. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015. 1473--1482.Google ScholarCross Ref
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding Conference on Empirical Methods in Natural Language Processing, EMNLP 2016. 457--468.Google Scholar
Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alexander G. Hauptmann. 2015. DevNet: A Deep Event Network for multimedia event detection and evidence recounting IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015. 2568--2577.Google Scholar
Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2013. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition. In IEEE International Conference on Computer Vision, ICCV 2013. 2712--2719. Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 770--778.Google Scholar
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In Proceedings of the 26th International Conference on World Wide Web, WWW 2017. 173--182. Google ScholarDigital Library
Aiwen Jiang, Fang Wang, Fatih Porikli, and Yi Li. 2015. Compositional Memory for Visual Question Answering. CoRR Vol. abs/1511.05676 (2015). http://arxiv.org/abs/1511.05676Google Scholar
Andrej Karpathy, Armand Joulin, and Fei-Fei Li. 2014. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014. 1889--1897. Google ScholarDigital Library
Jin-Hwa Kim, Sang-Woo Lee, Dong-Hyun Kwak, Min-Oh Heo, Jeonghee Kim, JungWoo Ha, and Byoung-Tak Zhang. 2016. Multimodal Residual Learning for Visual QA. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016. 361--369. Google ScholarDigital Library
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. 375--383.Google Scholar
Lin Ma, Zhengdong Lu, and Hang Li. 2016. Learning to Answer Questions from Image Using Convolutional Neural Network Thirtieth AAAI Conference on Artificial Intelligence. 3567--3573. Google ScholarDigital Library
Tegan Maharaj, Nicolas Ballas, Aaron C. Courville, and Christopher Joseph Pal. 2016. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. CoRR Vol. abs/1611.07810 (2016). http://arxiv.org/abs/1611.07810Google Scholar
Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015. Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images IEEE International Conference on Computer Vision, ICCV 2015. 1--9. Google ScholarDigital Library
Amir Mazaheri, Dong Zhang, and Mubarak Shah. 2016. Video Fill in the Blank with Merging LSTMs. CoRR Vol. abs/1610.04062 (2016). http://arxiv.org/abs/1610.04062Google Scholar
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. CoRR Vol. abs/1301.3781 (2013). http://arxiv.org/abs/1301.3781Google Scholar
Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. 2016. Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 30--38.Google Scholar
Vignesh Ramanathan, Percy Liang, and Fei-Fei Li. 2013. Video Event Understanding Using Natural Language Descriptions IEEE International Conference on Computer Vision, ICCV 2013. 905--912. Google ScholarDigital Library
Anna Rohrbach, Marcus Rohrbach, Wei Qiu, Annemarie Friedrich, Manfred Pinkal, and Bernt Schiele. 2014. Coherent Multi-sentence Video Description with Variable Level of Detail Pattern Recognition - 36th German Conference, GCPR 2014, Proceedings. 184--195.Google Scholar
Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating Video Content to Natural Language Descriptions IEEE International Conference on Computer Vision, ICCV 2013. 433--440. Google ScholarDigital Library
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR Vol. abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556Google Scholar
Laure Soulier, Lynda Tamine, and Gia-Hung Nguyen. 2016. Answering Twitter Questions: a Model for Recommending Answerers through Social Collaboration. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016. 267--276. Google ScholarDigital Library
Kevin D. Tang, Bangpeng Yao, Fei-Fei Li, and Daphne Koller. 2013. Combining the Right Features for Complex Event Recognition IEEE International Conference on Computer Vision, ICCV 2013. 2696--2703. Google ScholarDigital Library
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2015 a. Sequence to Sequence - Video to Text. In IEEE International Conference on Computer Vision, ICCV 2015. 4534--4542. Google ScholarDigital Library
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond J. Mooney, and Kate Saenko. 2015 b. Translating Videos to Natural Language Using Deep Recurrent Neural Networks NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1494--1504.Google Scholar
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015. 3156--3164.Google ScholarCross Ref
Di Wang and Eric Nyberg. 2015. A Long Short-Term Memory Model for Answer Sentence Selection in Question Answering Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, Volume 2: Short Papers. 707--712.Google Scholar
Pengwei Wang, Lei Ji, Jun Yan, Lianwen Jin, and Wei-Ying Ma. 2016. Learning to Extract Conditional Knowledge for Question Answering using Dialogue Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016. 277--286. Google ScholarDigital Library
Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic Memory Networks for Visual and Textual Question Answering International Conference on Machine Learning, ICML 2016. 2397--2406. Google ScholarDigital Library
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked Attention Networks for Image Question Answering IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 21--29.Google ScholarCross Ref
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher J. Pal, Hugo Larochelle, and Aaron C. Courville. 2015. Describing Videos by Exploiting Temporal Structure IEEE International Conference on Computer Vision, ICCV 2015. 4507--4515. Google ScholarDigital Library
Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2016. Video Captioning and Retrieval Models with Semantic Attention. CoRR Vol. abs/1610.02947 (2016). http://arxiv.org/abs/1610.02947Google Scholar
Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, and Min Sun. 2017. Leveraging Video Descriptions to Learn Video Question Answering Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. 4334--4340.Google ScholarDigital Library
Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. 2015. Simple Baseline for Visual Question Answering. CoRR Vol. abs/1512.02167 (2015). http://arxiv.org/abs/1512.02167Google Scholar
Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2015. Uncovering Temporal Context for Video Question and Answering. CoRR Vol. abs/1511.04670 (2015). http://arxiv.org/abs/1511.04670Google Scholar

Index Terms

Movie Fill in the Blank with Adaptive Temporal Attention and Description Update
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding
    2. Natural language processing
      1. Natural language generation
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Question answering

Recommendations

Multi-Cast Attention Networks
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Attention is typically used to select informative sub-phrases that are used for prediction. This paper investigates the novel use of attention as a form of feature augmentation, i.e, casted attention. We propose Multi-Cast Attention Networks (MCAN), a ...
Read More
Video Description with Spatial-Temporal Attention
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Temporal attention has been widely used in video description to adaptively focus on important frames. However, most existing methods based on temporal attention suffer from the problems of recognition error and detail missing, because only coarse frame-...
Read More
Movie Description

Audio description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
November 2017
2604 pages
ISBN:9781450349185
DOI:10.1145/3132847
General Chairs:
Ee-Peng Lim
Singapore Management University, Singapore
,
Marianne Winslett
University of Illinois at Urbana-Champaign, USA, and Advanced Digital Sciences Center, Singapore
,
Program Chairs:
Mark Sanderson
RMIT, Australia
,
Ada Fu
Chinese University of Hong Kong, Hong Kong
,
Jimeng Sun
Georgia Tech, USA
,
Shane Culpepper
RMIT, Australia
,
Eric Lo
Chinese University of Hong Kong, Hong Kong
,
Joyce Ho
Emory University, USA
,
Debora Donato
Mix Tech, Inc., USA
,
Rakesh Agrawal
Data Insights Laboratories, USA
,
Yu Zheng
Microsoft Research Asia, China
,
Carlos Castillo
Qatar Computing Research Institute, Qatar
,
Aixin Sun
Nanyang Technological University, Singapore
,
Vincent S. Tseng
National Cheng Kung University, Taiwan
,
Chenliang Li
Wuhan University, China
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 November 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
adaptive temporal attention
description update
question answering
Qualifiers
- research-article
Conference

Acceptance Rates
CIKM '17 Paper Acceptance Rate171of855submissions,20%Overall Acceptance Rate1,861of8,427submissions,22%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 192
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Movie Fill in the Blank with Adaptive Temporal Attention and Description Update

CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multi-Cast Attention Networks

Video Description with Spatial-Temporal Attention

Movie Description

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media