skip to main content
10.1145/3209978.3210003acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Attentive Moment Retrieval in Videos

Authors Info & Claims
Published:27 June 2018Publication History

ABSTRACT

In the past few years, language-based video retrieval has attracted a lot of attention. However, as a natural extension, localizing the specific video moments within a video given a description query is seldom explored. Although these two tasks look similar, the latter is more challenging due to two main reasons: 1) The former task only needs to judge whether the query occurs in a video and returns an entire video, but the latter is expected to judge which moment within a video matches the query and accurately returns the start and end points of the moment. Due to the fact that different moments in a video have varying durations and diverse spatial-temporal characteristics, uncovering the underlying moments is highly challenging. 2) As for the key component of relevance estimation, the former usually embeds a video and the query into a common space to compute the relevance score. However, the later task concerns moment localization where not only the features of a specific moment matter, but the context information of the moment also contributes a lot. For example, the query may contain temporal constraint words, such as "first'', therefore need temporal context to properly comprehend them. To address these issues, we develop an Attentive Cross-Modal Retrieval Network. In particular, we design a memory attention mechanism to emphasize the visual features mentioned in the query and simultaneously incorporate their context. In the light of this, we obtain the augmented moment representation. Meanwhile, a cross-modal fusion sub-network learns both the intra-modality and inter-modality dynamics, which can enhance the learning of moment-query representation. We evaluate our method on two datasets: DiDeMo and TACoS. Extensive experiments show the effectiveness of our model as compared to the state-of-the-art methods.

References

  1. Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien . 2016. Unsupervised Learning from Narrated Instruction Videos Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4575--4583.Google ScholarGoogle Scholar
  2. Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell . 2017. Localizing Moments in Video with Natural Language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5803--5812.Google ScholarGoogle ScholarCross RefCross Ref
  3. Da Cao, Xiangnan He, Liqiang Nie, Xiaochi Wei, Xia Hu, Shunxiang Wu, and Tat-Seng Chua . 2017 a. Cross-platform App Recommendation by Jointly Modeling Ratings and Texts. ACM Transactions on Information Systems Vol. 35, 4 (2017), 37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Da Cao, Liqiang Nie, Xiangnan He, Xiaochi Wei, Shunzhi Zhu, and Tat-Seng Chua . 2017 b. Embedding Factorization Models for Jointly Recommending Items and User Generated Lists. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 585--594. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Zhiyong Cheng, Xuanchong Li, Jialie Shen, and Alexander G Hauptmann . 2016. Which Information Sources are More Effective and Reliable in Video Search Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1069--1072. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Fuli Feng, Xiangnan He, Yiqun Liu, Liqiang Nie, and Tat-Seng Chua . 2018. Learning on Partial-order Hypergraphs. In Proceedings of the International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1523--1532. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Fuli Feng, Liqiang Nie, Xiang Wang, Richang Hong, and Tat-Seng Chua . 2017. Computational Social Indicators: A Case Study of Chinese University Ranking Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 455--464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et almbox. . 2013. Devise: A Deep Visual-semantic Embedding Model. In Proceedings of the Advances in Neural Information Processing Systems. NIPS, 2121--2129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Adrien Gaidon, Zaid Harchaoui, and Cordelia Schmid . 2011. Actom Sequence Models for Efficient Action Detection Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3201--3208. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia . 2017 a. TALL: Temporal Activity Localization via Language Query Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5267--5275.Google ScholarGoogle Scholar
  11. Jiyang Gao, Zhenheng Yang, Chen Sun, Kan Chen, and Ram Nevatia . 2017 b. TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals Proceedings of the IEEE International Conference on Computer Vision. IEEE, 3628--3636.Google ScholarGoogle Scholar
  12. Xiangnan He and Tat-Seng Chua . 2017. Neural Factorization Machines for Sparse Predictive Analytics Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 355--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua . 2017. Neural Collaborative Filtering. In Proceedings of the International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 173--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chiori Hori, Takaaki Hori, Teng-Yok Lee, Ziming Zhang, Bret Harsham, John R Hershey, Tim K Marks, and Kazuhiko Sumi . 2017. Attention-based Multimodal Fusion for Video Description Proceedings of the IEEE International Conference on Computer Vision. IEEE, 4203--4212.Google ScholarGoogle Scholar
  15. Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell . 2016. Natural Language Object Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4555--4564.Google ScholarGoogle ScholarCross RefCross Ref
  16. Andrej Karpathy and Li Fei-Fei . 2015. Deep Visual-semantic Alignments for Generating Image Descriptions Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3128--3137.Google ScholarGoogle Scholar
  17. Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler . 2015. Skip-thought Vectors. In Proceedings of the Advances in Neural Information Processing Systems. NIPS, 3294--3302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urtasun . 2014 a. Visual Semantic Search: Retrieving Videos via Complex Textual Queries Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2657--2664. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urtasun . 2014 b. Visual Semantic Search: Retrieving Videos via Complex Textual Queries Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2657--2664. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Meng Liu, Liqiang Nie, Meng Wang, and Baoquan Chen . 2017. Towards Micro-video Understanding by Joint Sequential-Sparse Modeling Proceedings of the ACM International Conference on Multimedia. ACM, 970--978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Minh-Thang Luong, Hieu Pham, and Christopher D Manning . 2015. Effective Approaches to Attention-based Neural Machine Translation Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1412--1421.Google ScholarGoogle Scholar
  22. Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkil"a, and Naokazu Yokoya . 2016. Learning Joint Representations of Videos and Sentences with Web Image Search Proceedings of the European Conference on Computer Vision. Springer, 651--667.Google ScholarGoogle Scholar
  23. Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li, and Tao Mei . 2017. Seeing Bot. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1341--1344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jeffrey Pennington, Richard Socher, and Christopher Manning . 2014. Glove: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1532--1543.Google ScholarGoogle ScholarCross RefCross Ref
  25. Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal . 2013. Grounding Action Descriptions in Videos. Transactions of the Association for Computational Linguistics Vol. 1 (2013), 25--36.Google ScholarGoogle ScholarCross RefCross Ref
  26. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun . 2015. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks Proceedings of the Advances in Neural Information Processing Systems. NIPS, 91--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Marcus Rohrbach, Michaela Regneri, Mykhaylo Andriluka, Sikandar Amin, Manfred Pinkal, and Bernt Schiele . 2012. Script Data for Attribute-based Recognition of Composite Activities Proceedings of the European Conference on Computer Vision. Springer, 144--157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Zheng Shou, Dongang Wang, and Shih-Fu Chang . 2016. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1049--1058.Google ScholarGoogle Scholar
  29. Karen Simonyan and Andrew Zisserman . 2014. Very Deep Convolutional Networks for Large-scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  30. Bharat Singh, Tim K Marks, Michael Jones, Oncel Tuzel, and Ming Shao . 2016. A Multi-stream Bi-directional Recurrent Neural Network for Fine-grained Action Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1961--1970.Google ScholarGoogle ScholarCross RefCross Ref
  31. Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng . 2014. Grounded Compositional Semantics for Finding and Describing Images with Sentences. Transactions of the Association for Computational Linguistics Vol. 2 (2014), 207--218.Google ScholarGoogle ScholarCross RefCross Ref
  32. Jingkuan Song, Zhao Guo, Lianli Gao, Wu Liu, Dongxiang Zhang, and Heng Tao Shen . 2017 b. Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning. arXiv preprint arXiv:1706.01231 (2017).Google ScholarGoogle Scholar
  33. Xuemeng Song, Fuli Feng, Jinhuan Liu, Zekun Li, Liqiang Nie, and Jun Ma . 2017 a. NeuroStylist: Neural Compatibility Modeling for Clothing Matching Proceedings of the ACM International Conference on Multimedia. ACM, 753--761. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Stefanie Tellex and Deb Roy . 2009. Towards Surveillance Video Search by Natural Language Query Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, 38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri . 2015. Learning Spatiotemporal Features with 3D Convolutional Networks Proceedings of the IEEE International Conference on Computer Vision. IEEE, 4489--4497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. David Vallet, Frank Hopfgartner, Joemon M Jose, and Pablo Castells . 2011. Effects of Usage-based Feedback on Video Retrieval: a Simulation-based Study. ACM Transactions on Information Systems Vol. 29, 2 (2011), 11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Xiang Wang, Xiangnan He, Fuli Feng, Liqiang Nie, and Tat-Seng Chua . 2018. TEM: Tree-enhanced Embedding Model for Explainable Recommendation Proceedings of the International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1543--1552. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Xiang Wang, Xiangnan He, Liqiang Nie, and Tat-Seng Chua . 2017 a. Item Silk Road: Recommending Items from Information Domains to Social Users Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 185--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Xiang Wang, Liqiang Nie, Xuemeng Song, Dongxiang Zhang, and Tat-Seng Chua . 2017 b. Unifying Virtual and Physical Worlds: Learning Toward Local and Global Consistency. ACM Transactions on Information Systems Vol. 36, 1 (2017), 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang . 2017 b. Video Question Answering via Gradually Refined Attention over Appearance and Motion. In Proceedings of the ACM Conference on Multimedia. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jun Xu, Tao Mei, Ting Yao, and Yong Rui . 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5288--5296.Google ScholarGoogle Scholar
  42. Jun Xu, Ting Yao, Yongdong Zhang, and Tao Mei . 2017 a. Learning Multimodal Attention LSTM Networks for Video Captioning Proceedings of the ACM International Conference on Multimedia. ACM, 537--545. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio . 2015 a. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention International Conference on Machine Learning. ACM, 2048--2057. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Ran Xu, Caiming Xiong, Wei Chen, and Jason J Corso . 2015 b. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework.. In Proceedings of the American Association for Artificial Intelligence, Vol. Vol. 5. AAAI, 6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola . 2016. Stacked Attention Networks for Image Question Answering Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 21--29.Google ScholarGoogle Scholar
  46. Yunan Ye, Zhou Zhao, Yimeng Li, Long Chen, Jun Xiao, and Yueting Zhuang . 2017. Video Question Answering via Attribute-Augmented Attention Network Learning Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 829--832. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua . 2017. Visual Translation Embedding Network for Visual Relation Detection Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5532--5540.Google ScholarGoogle Scholar
  48. Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, and Yueting Zhuang . 2017 a. Video Question Answering via Hierarchical Dual-Level Attention Network Learning. In Proceedings of the ACM Conference on Multimedia. ACM, 1050--1058. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, and Yueting Zhuang . 2017 b. Video Question Answering via Hierarchical Spatio-Temporal Attention Networks Proceedings of the International Joint Conference on Artificial Intelligence. Morgan Kaufmann, 3518--3524. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Attentive Moment Retrieval in Videos

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval
        June 2018
        1509 pages
        ISBN:9781450356572
        DOI:10.1145/3209978

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 June 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGIR '18 Paper Acceptance Rate86of409submissions,21%Overall Acceptance Rate792of3,983submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader