ABSTRACT
In the past few years, language-based video retrieval has attracted a lot of attention. However, as a natural extension, localizing the specific video moments within a video given a description query is seldom explored. Although these two tasks look similar, the latter is more challenging due to two main reasons: 1) The former task only needs to judge whether the query occurs in a video and returns an entire video, but the latter is expected to judge which moment within a video matches the query and accurately returns the start and end points of the moment. Due to the fact that different moments in a video have varying durations and diverse spatial-temporal characteristics, uncovering the underlying moments is highly challenging. 2) As for the key component of relevance estimation, the former usually embeds a video and the query into a common space to compute the relevance score. However, the later task concerns moment localization where not only the features of a specific moment matter, but the context information of the moment also contributes a lot. For example, the query may contain temporal constraint words, such as "first'', therefore need temporal context to properly comprehend them. To address these issues, we develop an Attentive Cross-Modal Retrieval Network. In particular, we design a memory attention mechanism to emphasize the visual features mentioned in the query and simultaneously incorporate their context. In the light of this, we obtain the augmented moment representation. Meanwhile, a cross-modal fusion sub-network learns both the intra-modality and inter-modality dynamics, which can enhance the learning of moment-query representation. We evaluate our method on two datasets: DiDeMo and TACoS. Extensive experiments show the effectiveness of our model as compared to the state-of-the-art methods.
- Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien . 2016. Unsupervised Learning from Narrated Instruction Videos Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4575--4583.Google Scholar
- Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell . 2017. Localizing Moments in Video with Natural Language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5803--5812.Google ScholarCross Ref
- Da Cao, Xiangnan He, Liqiang Nie, Xiaochi Wei, Xia Hu, Shunxiang Wu, and Tat-Seng Chua . 2017 a. Cross-platform App Recommendation by Jointly Modeling Ratings and Texts. ACM Transactions on Information Systems Vol. 35, 4 (2017), 37. Google ScholarDigital Library
- Da Cao, Liqiang Nie, Xiangnan He, Xiaochi Wei, Shunzhi Zhu, and Tat-Seng Chua . 2017 b. Embedding Factorization Models for Jointly Recommending Items and User Generated Lists. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 585--594. Google ScholarDigital Library
- Zhiyong Cheng, Xuanchong Li, Jialie Shen, and Alexander G Hauptmann . 2016. Which Information Sources are More Effective and Reliable in Video Search Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1069--1072. Google ScholarDigital Library
- Fuli Feng, Xiangnan He, Yiqun Liu, Liqiang Nie, and Tat-Seng Chua . 2018. Learning on Partial-order Hypergraphs. In Proceedings of the International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1523--1532. Google ScholarDigital Library
- Fuli Feng, Liqiang Nie, Xiang Wang, Richang Hong, and Tat-Seng Chua . 2017. Computational Social Indicators: A Case Study of Chinese University Ranking Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 455--464. Google ScholarDigital Library
- Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et almbox. . 2013. Devise: A Deep Visual-semantic Embedding Model. In Proceedings of the Advances in Neural Information Processing Systems. NIPS, 2121--2129. Google ScholarDigital Library
- Adrien Gaidon, Zaid Harchaoui, and Cordelia Schmid . 2011. Actom Sequence Models for Efficient Action Detection Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3201--3208. Google ScholarDigital Library
- Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia . 2017 a. TALL: Temporal Activity Localization via Language Query Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5267--5275.Google Scholar
- Jiyang Gao, Zhenheng Yang, Chen Sun, Kan Chen, and Ram Nevatia . 2017 b. TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals Proceedings of the IEEE International Conference on Computer Vision. IEEE, 3628--3636.Google Scholar
- Xiangnan He and Tat-Seng Chua . 2017. Neural Factorization Machines for Sparse Predictive Analytics Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 355--364. Google ScholarDigital Library
- Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua . 2017. Neural Collaborative Filtering. In Proceedings of the International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 173--182. Google ScholarDigital Library
- Chiori Hori, Takaaki Hori, Teng-Yok Lee, Ziming Zhang, Bret Harsham, John R Hershey, Tim K Marks, and Kazuhiko Sumi . 2017. Attention-based Multimodal Fusion for Video Description Proceedings of the IEEE International Conference on Computer Vision. IEEE, 4203--4212.Google Scholar
- Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell . 2016. Natural Language Object Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4555--4564.Google ScholarCross Ref
- Andrej Karpathy and Li Fei-Fei . 2015. Deep Visual-semantic Alignments for Generating Image Descriptions Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3128--3137.Google Scholar
- Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler . 2015. Skip-thought Vectors. In Proceedings of the Advances in Neural Information Processing Systems. NIPS, 3294--3302. Google ScholarDigital Library
- Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urtasun . 2014 a. Visual Semantic Search: Retrieving Videos via Complex Textual Queries Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2657--2664. Google ScholarDigital Library
- Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urtasun . 2014 b. Visual Semantic Search: Retrieving Videos via Complex Textual Queries Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2657--2664. Google ScholarDigital Library
- Meng Liu, Liqiang Nie, Meng Wang, and Baoquan Chen . 2017. Towards Micro-video Understanding by Joint Sequential-Sparse Modeling Proceedings of the ACM International Conference on Multimedia. ACM, 970--978. Google ScholarDigital Library
- Minh-Thang Luong, Hieu Pham, and Christopher D Manning . 2015. Effective Approaches to Attention-based Neural Machine Translation Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1412--1421.Google Scholar
- Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkil"a, and Naokazu Yokoya . 2016. Learning Joint Representations of Videos and Sentences with Web Image Search Proceedings of the European Conference on Computer Vision. Springer, 651--667.Google Scholar
- Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li, and Tao Mei . 2017. Seeing Bot. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1341--1344. Google ScholarDigital Library
- Jeffrey Pennington, Richard Socher, and Christopher Manning . 2014. Glove: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1532--1543.Google ScholarCross Ref
- Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal . 2013. Grounding Action Descriptions in Videos. Transactions of the Association for Computational Linguistics Vol. 1 (2013), 25--36.Google ScholarCross Ref
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun . 2015. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks Proceedings of the Advances in Neural Information Processing Systems. NIPS, 91--99. Google ScholarDigital Library
- Marcus Rohrbach, Michaela Regneri, Mykhaylo Andriluka, Sikandar Amin, Manfred Pinkal, and Bernt Schiele . 2012. Script Data for Attribute-based Recognition of Composite Activities Proceedings of the European Conference on Computer Vision. Springer, 144--157. Google ScholarDigital Library
- Zheng Shou, Dongang Wang, and Shih-Fu Chang . 2016. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1049--1058.Google Scholar
- Karen Simonyan and Andrew Zisserman . 2014. Very Deep Convolutional Networks for Large-scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- Bharat Singh, Tim K Marks, Michael Jones, Oncel Tuzel, and Ming Shao . 2016. A Multi-stream Bi-directional Recurrent Neural Network for Fine-grained Action Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1961--1970.Google ScholarCross Ref
- Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng . 2014. Grounded Compositional Semantics for Finding and Describing Images with Sentences. Transactions of the Association for Computational Linguistics Vol. 2 (2014), 207--218.Google ScholarCross Ref
- Jingkuan Song, Zhao Guo, Lianli Gao, Wu Liu, Dongxiang Zhang, and Heng Tao Shen . 2017 b. Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning. arXiv preprint arXiv:1706.01231 (2017).Google Scholar
- Xuemeng Song, Fuli Feng, Jinhuan Liu, Zekun Li, Liqiang Nie, and Jun Ma . 2017 a. NeuroStylist: Neural Compatibility Modeling for Clothing Matching Proceedings of the ACM International Conference on Multimedia. ACM, 753--761. Google ScholarDigital Library
- Stefanie Tellex and Deb Roy . 2009. Towards Surveillance Video Search by Natural Language Query Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, 38. Google ScholarDigital Library
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri . 2015. Learning Spatiotemporal Features with 3D Convolutional Networks Proceedings of the IEEE International Conference on Computer Vision. IEEE, 4489--4497. Google ScholarDigital Library
- David Vallet, Frank Hopfgartner, Joemon M Jose, and Pablo Castells . 2011. Effects of Usage-based Feedback on Video Retrieval: a Simulation-based Study. ACM Transactions on Information Systems Vol. 29, 2 (2011), 11. Google ScholarDigital Library
- Xiang Wang, Xiangnan He, Fuli Feng, Liqiang Nie, and Tat-Seng Chua . 2018. TEM: Tree-enhanced Embedding Model for Explainable Recommendation Proceedings of the International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1543--1552. Google ScholarDigital Library
- Xiang Wang, Xiangnan He, Liqiang Nie, and Tat-Seng Chua . 2017 a. Item Silk Road: Recommending Items from Information Domains to Social Users Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 185--194. Google ScholarDigital Library
- Xiang Wang, Liqiang Nie, Xuemeng Song, Dongxiang Zhang, and Tat-Seng Chua . 2017 b. Unifying Virtual and Physical Worlds: Learning Toward Local and Global Consistency. ACM Transactions on Information Systems Vol. 36, 1 (2017), 4. Google ScholarDigital Library
- Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang . 2017 b. Video Question Answering via Gradually Refined Attention over Appearance and Motion. In Proceedings of the ACM Conference on Multimedia. ACM. Google ScholarDigital Library
- Jun Xu, Tao Mei, Ting Yao, and Yong Rui . 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5288--5296.Google Scholar
- Jun Xu, Ting Yao, Yongdong Zhang, and Tao Mei . 2017 a. Learning Multimodal Attention LSTM Networks for Video Captioning Proceedings of the ACM International Conference on Multimedia. ACM, 537--545. Google ScholarDigital Library
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio . 2015 a. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention International Conference on Machine Learning. ACM, 2048--2057. Google ScholarDigital Library
- Ran Xu, Caiming Xiong, Wei Chen, and Jason J Corso . 2015 b. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework.. In Proceedings of the American Association for Artificial Intelligence, Vol. Vol. 5. AAAI, 6. Google ScholarDigital Library
- Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola . 2016. Stacked Attention Networks for Image Question Answering Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 21--29.Google Scholar
- Yunan Ye, Zhou Zhao, Yimeng Li, Long Chen, Jun Xiao, and Yueting Zhuang . 2017. Video Question Answering via Attribute-Augmented Attention Network Learning Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 829--832. Google ScholarDigital Library
- Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua . 2017. Visual Translation Embedding Network for Visual Relation Detection Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5532--5540.Google Scholar
- Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, and Yueting Zhuang . 2017 a. Video Question Answering via Hierarchical Dual-Level Attention Network Learning. In Proceedings of the ACM Conference on Multimedia. ACM, 1050--1058. Google ScholarDigital Library
- Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, and Yueting Zhuang . 2017 b. Video Question Answering via Hierarchical Spatio-Temporal Attention Networks Proceedings of the International Joint Conference on Artificial Intelligence. Morgan Kaufmann, 3518--3524. Google ScholarDigital Library
Index Terms
- Attentive Moment Retrieval in Videos
Recommendations
Video Corpus Moment Retrieval with Contrastive Learning
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information RetrievalGiven a collection of untrimmed and unsegmented videos, video corpus moment retrieval (VCMR) is to retrieve a temporal moment (i.e., a fraction of a video) that semantically corresponds to a given text query. As video and text are from two distinct ...
Cross-modal Moment Localization in Videos
MM '18: Proceedings of the 26th ACM international conference on MultimediaIn this paper, we address the temporal moment localization issue, namely, localizing a video moment described by a natural language query in an untrimmed video. This is a general yet challenging vision-language task since it requires not only the ...
Moment-Based Techniques for Image Retrieval
DEXA '08: Proceedings of the 2008 19th International Conference on Database and Expert Systems ApplicationIn this paper we analyze some shape-based image retrieval methods which use different types of geometric and algebraic moments and Fourier descriptors. Moments have been widely used in pattern recognition applications to describe the geometrical ...
Comments