research-article

Attentive Moment Retrieval in Videos

Authors:
Meng Liu

Shandong University, Qingdao, China

Shandong University, Qingdao, China
View Profile

,
Xiang Wang

National University of Singapore, Singapore, Singapore

National University of Singapore, Singapore, Singapore
View Profile

,
Liqiang Nie

Shandong University, Qingdao, China

Shandong University, Qingdao, China
View Profile

,
Xiangnan He

National University of Singapore, Singapore, Singapore

National University of Singapore, Singapore, Singapore
View Profile

,
Baoquan Chen

Shandong University, Qingdao, China

Shandong University, Qingdao, China
View Profile

,
Tat-Seng Chua

National University of Singapore, Singapore, Singapore

National University of Singapore, Singapore, Singapore
View Profile

SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information RetrievalJune 2018Pages 15–24https://doi.org/10.1145/3209978.3210003

Published:27 June 2018Publication History

SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

Pages 15–24

ABSTRACT

In the past few years, language-based video retrieval has attracted a lot of attention. However, as a natural extension, localizing the specific video moments within a video given a description query is seldom explored. Although these two tasks look similar, the latter is more challenging due to two main reasons: 1) The former task only needs to judge whether the query occurs in a video and returns an entire video, but the latter is expected to judge which moment within a video matches the query and accurately returns the start and end points of the moment. Due to the fact that different moments in a video have varying durations and diverse spatial-temporal characteristics, uncovering the underlying moments is highly challenging. 2) As for the key component of relevance estimation, the former usually embeds a video and the query into a common space to compute the relevance score. However, the later task concerns moment localization where not only the features of a specific moment matter, but the context information of the moment also contributes a lot. For example, the query may contain temporal constraint words, such as "first'', therefore need temporal context to properly comprehend them. To address these issues, we develop an Attentive Cross-Modal Retrieval Network. In particular, we design a memory attention mechanism to emphasize the visual features mentioned in the query and simultaneously incorporate their context. In the light of this, we obtain the augmented moment representation. Meanwhile, a cross-modal fusion sub-network learns both the intra-modality and inter-modality dynamics, which can enhance the learning of moment-query representation. We evaluate our method on two datasets: DiDeMo and TACoS. Extensive experiments show the effectiveness of our model as compared to the state-of-the-art methods.

References

Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien . 2016. Unsupervised Learning from Narrated Instruction Videos Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4575--4583.Google Scholar
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell . 2017. Localizing Moments in Video with Natural Language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5803--5812.Google ScholarCross Ref
Da Cao, Xiangnan He, Liqiang Nie, Xiaochi Wei, Xia Hu, Shunxiang Wu, and Tat-Seng Chua . 2017 a. Cross-platform App Recommendation by Jointly Modeling Ratings and Texts. ACM Transactions on Information Systems Vol. 35, 4 (2017), 37. Google ScholarDigital Library
Da Cao, Liqiang Nie, Xiangnan He, Xiaochi Wei, Shunzhi Zhu, and Tat-Seng Chua . 2017 b. Embedding Factorization Models for Jointly Recommending Items and User Generated Lists. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 585--594. Google ScholarDigital Library
Zhiyong Cheng, Xuanchong Li, Jialie Shen, and Alexander G Hauptmann . 2016. Which Information Sources are More Effective and Reliable in Video Search Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1069--1072. Google ScholarDigital Library
Fuli Feng, Xiangnan He, Yiqun Liu, Liqiang Nie, and Tat-Seng Chua . 2018. Learning on Partial-order Hypergraphs. In Proceedings of the International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1523--1532. Google ScholarDigital Library
Fuli Feng, Liqiang Nie, Xiang Wang, Richang Hong, and Tat-Seng Chua . 2017. Computational Social Indicators: A Case Study of Chinese University Ranking Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 455--464. Google ScholarDigital Library
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et almbox. . 2013. Devise: A Deep Visual-semantic Embedding Model. In Proceedings of the Advances in Neural Information Processing Systems. NIPS, 2121--2129. Google ScholarDigital Library
Adrien Gaidon, Zaid Harchaoui, and Cordelia Schmid . 2011. Actom Sequence Models for Efficient Action Detection Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3201--3208. Google ScholarDigital Library
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia . 2017 a. TALL: Temporal Activity Localization via Language Query Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5267--5275.Google Scholar
Jiyang Gao, Zhenheng Yang, Chen Sun, Kan Chen, and Ram Nevatia . 2017 b. TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals Proceedings of the IEEE International Conference on Computer Vision. IEEE, 3628--3636.Google Scholar
Xiangnan He and Tat-Seng Chua . 2017. Neural Factorization Machines for Sparse Predictive Analytics Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 355--364. Google ScholarDigital Library
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua . 2017. Neural Collaborative Filtering. In Proceedings of the International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 173--182. Google ScholarDigital Library
Chiori Hori, Takaaki Hori, Teng-Yok Lee, Ziming Zhang, Bret Harsham, John R Hershey, Tim K Marks, and Kazuhiko Sumi . 2017. Attention-based Multimodal Fusion for Video Description Proceedings of the IEEE International Conference on Computer Vision. IEEE, 4203--4212.Google Scholar
Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell . 2016. Natural Language Object Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4555--4564.Google ScholarCross Ref
Andrej Karpathy and Li Fei-Fei . 2015. Deep Visual-semantic Alignments for Generating Image Descriptions Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3128--3137.Google Scholar
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler . 2015. Skip-thought Vectors. In Proceedings of the Advances in Neural Information Processing Systems. NIPS, 3294--3302. Google ScholarDigital Library
Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urtasun . 2014 a. Visual Semantic Search: Retrieving Videos via Complex Textual Queries Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2657--2664. Google ScholarDigital Library
Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urtasun . 2014 b. Visual Semantic Search: Retrieving Videos via Complex Textual Queries Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2657--2664. Google ScholarDigital Library
Meng Liu, Liqiang Nie, Meng Wang, and Baoquan Chen . 2017. Towards Micro-video Understanding by Joint Sequential-Sparse Modeling Proceedings of the ACM International Conference on Multimedia. ACM, 970--978. Google ScholarDigital Library
Minh-Thang Luong, Hieu Pham, and Christopher D Manning . 2015. Effective Approaches to Attention-based Neural Machine Translation Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1412--1421.Google Scholar
Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkil"a, and Naokazu Yokoya . 2016. Learning Joint Representations of Videos and Sentences with Web Image Search Proceedings of the European Conference on Computer Vision. Springer, 651--667.Google Scholar
Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li, and Tao Mei . 2017. Seeing Bot. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1341--1344. Google ScholarDigital Library
Jeffrey Pennington, Richard Socher, and Christopher Manning . 2014. Glove: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1532--1543.Google ScholarCross Ref
Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal . 2013. Grounding Action Descriptions in Videos. Transactions of the Association for Computational Linguistics Vol. 1 (2013), 25--36.Google ScholarCross Ref
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun . 2015. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks Proceedings of the Advances in Neural Information Processing Systems. NIPS, 91--99. Google ScholarDigital Library
Marcus Rohrbach, Michaela Regneri, Mykhaylo Andriluka, Sikandar Amin, Manfred Pinkal, and Bernt Schiele . 2012. Script Data for Attribute-based Recognition of Composite Activities Proceedings of the European Conference on Computer Vision. Springer, 144--157. Google ScholarDigital Library
Zheng Shou, Dongang Wang, and Shih-Fu Chang . 2016. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1049--1058.Google Scholar
Karen Simonyan and Andrew Zisserman . 2014. Very Deep Convolutional Networks for Large-scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
Bharat Singh, Tim K Marks, Michael Jones, Oncel Tuzel, and Ming Shao . 2016. A Multi-stream Bi-directional Recurrent Neural Network for Fine-grained Action Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1961--1970.Google ScholarCross Ref
Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng . 2014. Grounded Compositional Semantics for Finding and Describing Images with Sentences. Transactions of the Association for Computational Linguistics Vol. 2 (2014), 207--218.Google ScholarCross Ref
Jingkuan Song, Zhao Guo, Lianli Gao, Wu Liu, Dongxiang Zhang, and Heng Tao Shen . 2017 b. Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning. arXiv preprint arXiv:1706.01231 (2017).Google Scholar
Xuemeng Song, Fuli Feng, Jinhuan Liu, Zekun Li, Liqiang Nie, and Jun Ma . 2017 a. NeuroStylist: Neural Compatibility Modeling for Clothing Matching Proceedings of the ACM International Conference on Multimedia. ACM, 753--761. Google ScholarDigital Library
Stefanie Tellex and Deb Roy . 2009. Towards Surveillance Video Search by Natural Language Query Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, 38. Google ScholarDigital Library
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri . 2015. Learning Spatiotemporal Features with 3D Convolutional Networks Proceedings of the IEEE International Conference on Computer Vision. IEEE, 4489--4497. Google ScholarDigital Library
David Vallet, Frank Hopfgartner, Joemon M Jose, and Pablo Castells . 2011. Effects of Usage-based Feedback on Video Retrieval: a Simulation-based Study. ACM Transactions on Information Systems Vol. 29, 2 (2011), 11. Google ScholarDigital Library
Xiang Wang, Xiangnan He, Fuli Feng, Liqiang Nie, and Tat-Seng Chua . 2018. TEM: Tree-enhanced Embedding Model for Explainable Recommendation Proceedings of the International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1543--1552. Google ScholarDigital Library
Xiang Wang, Xiangnan He, Liqiang Nie, and Tat-Seng Chua . 2017 a. Item Silk Road: Recommending Items from Information Domains to Social Users Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 185--194. Google ScholarDigital Library
Xiang Wang, Liqiang Nie, Xuemeng Song, Dongxiang Zhang, and Tat-Seng Chua . 2017 b. Unifying Virtual and Physical Worlds: Learning Toward Local and Global Consistency. ACM Transactions on Information Systems Vol. 36, 1 (2017), 4. Google ScholarDigital Library
Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang . 2017 b. Video Question Answering via Gradually Refined Attention over Appearance and Motion. In Proceedings of the ACM Conference on Multimedia. ACM. Google ScholarDigital Library
Jun Xu, Tao Mei, Ting Yao, and Yong Rui . 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5288--5296.Google Scholar
Jun Xu, Ting Yao, Yongdong Zhang, and Tao Mei . 2017 a. Learning Multimodal Attention LSTM Networks for Video Captioning Proceedings of the ACM International Conference on Multimedia. ACM, 537--545. Google ScholarDigital Library
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio . 2015 a. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention International Conference on Machine Learning. ACM, 2048--2057. Google ScholarDigital Library
Ran Xu, Caiming Xiong, Wei Chen, and Jason J Corso . 2015 b. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework.. In Proceedings of the American Association for Artificial Intelligence, Vol. Vol. 5. AAAI, 6. Google ScholarDigital Library
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola . 2016. Stacked Attention Networks for Image Question Answering Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 21--29.Google Scholar
Yunan Ye, Zhou Zhao, Yimeng Li, Long Chen, Jun Xiao, and Yueting Zhuang . 2017. Video Question Answering via Attribute-Augmented Attention Network Learning Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 829--832. Google ScholarDigital Library
Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua . 2017. Visual Translation Embedding Network for Visual Relation Detection Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5532--5540.Google Scholar
Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, and Yueting Zhuang . 2017 a. Video Question Answering via Hierarchical Dual-Level Attention Network Learning. In Proceedings of the ACM Conference on Multimedia. ACM, 1050--1058. Google ScholarDigital Library
Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, and Yueting Zhuang . 2017 b. Video Question Answering via Hierarchical Spatio-Temporal Attention Networks Proceedings of the International Joint Conference on Artificial Intelligence. Morgan Kaufmann, 3518--3524. Google ScholarDigital Library

Index Terms

Attentive Moment Retrieval in Videos
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Novelty in information retrieval
    2. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Video search

Recommendations

Video Corpus Moment Retrieval with Contrastive Learning
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Given a collection of untrimmed and unsegmented videos, video corpus moment retrieval (VCMR) is to retrieve a temporal moment (i.e., a fraction of a video) that semantically corresponds to a given text query. As video and text are from two distinct ...
Read More
Cross-modal Moment Localization in Videos
MM '18: Proceedings of the 26th ACM international conference on Multimedia

In this paper, we address the temporal moment localization issue, namely, localizing a video moment described by a natural language query in an untrimmed video. This is a general yet challenging vision-language task since it requires not only the ...
Read More
Moment-Based Techniques for Image Retrieval
DEXA '08: Proceedings of the 2008 19th International Conference on Database and Expert Systems Application

In this paper we analyze some shape-based image retrieval methods which use different types of geometric and algebraic moments and Fourier descriptors. Moments have been widely used in pattern recognition applications to describe the geometrical ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval
June 2018
1509 pages
ISBN:9781450356572
DOI:10.1145/3209978
General Chairs:
Kevyn Collins-Thompson
University of Michigan, United States
,
Qiaozhu Mei
University of Michigan, United States
,
Program Chairs:
Brian Davison
Lehigh University, United States
,
Yiqun Liu
Tsinghua University, China
,
Emine Yilmaz
University College London, United Kingdom
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 June 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cross-modal retrieval
moment localization
temporal memory attention
tensor fusion
Qualifiers
- research-article
Conference

Acceptance Rates
SIGIR '18 Paper Acceptance Rate86of409submissions,21%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 167
  Total Citations
  View Citations
- 1,205
  Total Downloads
- Downloads (Last 12 months)85
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Attentive Moment Retrieval in Videos

SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Video Corpus Moment Retrieval with Contrastive Learning

Cross-modal Moment Localization in Videos

Moment-Based Techniques for Image Retrieval

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media