ABSTRACT
In this paper, we address the temporal moment localization issue, namely, localizing a video moment described by a natural language query in an untrimmed video. This is a general yet challenging vision-language task since it requires not only the localization of moments, but also the multimodal comprehension of textual-temporal information (e.g., "first" and "leaving") that helps to distinguish the desired moment from the others, especially those with the similar visual content. While existing studies treat a given language query as a single unit, we propose to decompose it into two components: the relevant cue related to the desired moment localization and the irrelevant one meaningless to the localization. This allows us to flexibly adapt to arbitrary queries in an end-to-end framework. In our proposed model, a language-temporal attention network is utilized to learn the word attention based on the temporal context information in the video. Therefore, our model can automatically select "what words to listen to" for localizing the desired moment. We evaluate the proposed model on two public benchmark datasets: DiDeMo and Charades-STA. The experimental results verify its superiority over several state-of-the-art methods.
- Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing Moments in Video With Natural Language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5803--5812.Google ScholarCross Ref
- Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques, and Jitendra Malik. 2014. Multiscale Combinatorial Grouping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 328--335. Google ScholarDigital Library
- Y Alp Aslandogan and Clement T. Yu. 1999. Techniques and Systems for Image and Video Retrieval. IEEE Transactions on Knowledge and Data Engineering, Vol. 11, 1 (1999), 56--63. Google ScholarDigital Library
- Piotr Bojanowski, Rémi Lajugie, Edouard Grave, Francis Bach, Ivan Laptev, Jean Ponce, and Cordelia Schmid. 2015. Weakly-supervised Alignment of Video with Text. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 4462--4470. Google ScholarDigital Library
- Maaike HT De Boer, Yi-Jie Lu, Hao Zhang, Klamer Schutte, Chong-Wah Ngo, and Wessel Kraaij. 2017. Semantic Reasoning in Zero Example Video Event Retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 13, 4 (2017), 1--17. Google ScholarDigital Library
- Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. 2016. Daps: Deep Action Proposals for Action Understanding. In Proceedings of the European Conference on Computer Vision. Springer, 768--784.Google ScholarCross Ref
- Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017a. TALL: Temporal Activity Localization via Language Query. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5267--5275.Google ScholarCross Ref
- Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram Nevatia. 2017b. TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 3628--3636.Google ScholarCross Ref
- Haiyun Guo, Jinqiao Wang, Min Xu, Zheng-Jun Zha, and Hanqing Lu. 2015. Learning Multi-view Deep Features for Small Object Retrieval in Surveillance Scenarios. In Proceedings of the ACM International Conference on Multimedia. ACM, 859--862. Google ScholarDigital Library
- Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. 2017. Modeling Relationships in Referential Expressions with Compositional Modular Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4418--4427.Google ScholarCross Ref
- Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. 2016. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4555--4564.Google ScholarCross Ref
- Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought Vectors. In Proceedings of the Advances in Neural Information Processing Systems. NIPS, 3294--3302. Google ScholarDigital Library
- Philipp Krahenbühl and Vladlen Koltun. 2014. Geodesic object proposals. In Proceedings of the European Conference on Computer Vision. Springer, 725--739.Google ScholarCross Ref
- Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urtasun. 2014. Visual Semantic Search: Retrieving Videos via Complex Textual Queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2657--2664. Google ScholarDigital Library
- Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single Shot Temporal Action Detection. In Proceedings of the ACM International Conference on Multimedia. ACM, 988--996. Google ScholarDigital Library
- Meng Liu, Liqiang Nie, Meng Wang, and Baoquan Chen. 2017. Towards micro-video understanding by joint sequential-sparse modeling. In Proceedings of the ACM International Conference on Multimedia. ACM, 970--978. Google ScholarDigital Library
- Ruotian Luo and Gregory Shakhnarovich. 2017. Comprehension-guided Referring Expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7102--7111.Google ScholarCross Ref
- Xin Luo, Liqiang Nie, Xiangnan He, Ye Wu, Zhen-Duo Chen, and Xin-Shun Xu. 2018. Fast Scalable Supervised Hashing.. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 735--744. Google ScholarDigital Library
- Shugao Ma, Leonid Sigal, and Stan Sclaroff. 2016. Learning Activity Progression in LSTMs for Activity Detection and Early Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1942--1950.Google ScholarCross Ref
- Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and Comprehension of Unambiguous Object Descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 11--20.Google ScholarCross Ref
- Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. 2016. Modeling Context between Objects for Referring Expression Understanding. In Proceedings of the European Conference on Computer Vision. Springer, 792--807.Google ScholarCross Ref
- Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkila, and Naokazu Yokoya. 2016. Learning Joint Representations of Videos and Sentences with Web Image Search. In Proceedings of the European Conference on Computer Vision. Springer, 651--667.Google ScholarCross Ref
- Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1532--1543.Google ScholarCross Ref
- Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association of Computational Linguistics, Vol. 1 (2013), 25--36.Google ScholarCross Ref
- Remi Ronfard. 2004. Reading Movies: An Integrated DVD Player for Browsing Movies and Their Scripts. In Proceedings of the ACM International Conference on Multimedia. ACM, 740--741. Google ScholarDigital Library
- Klaus Schoeffmann and Frank Hopfgartner. 2015. Interactive Video Search. In Proceedings of the ACM International Conference on Multimedia. ACM, 1321--1322. Google ScholarDigital Library
- Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1049--1058.Google ScholarCross Ref
- Karen Simonyan and Andrew Zisserman. 2014a. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Advances in Neural Information Processing Systems. NIPS, 568--576. Google ScholarDigital Library
- Karen Simonyan and Andrew Zisserman. 2014b. Very Deep Convolutional Networks for Large-scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014), 1--14.Google Scholar
- Bharat Singh, Tim K Marks, Michael Jones, Oncel Tuzel, and Ming Shao. 2016. A Multi-stream Bi-directional Recurrent Neural Network for Fine-grained Action Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1961--1970.Google ScholarCross Ref
- Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. 2011. Multiple Feature Hashing for Real-time Large Scale Near-Duplicate Video Retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, 423--432. Google ScholarDigital Library
- Young Chol Song, Iftekhar Naim, Abdullah Al Mamun, Kaustubh Kulkarni, Parag Singla, Jiebo Luo, Daniel Gildea, and Henry A Kautz. 2016. Unsupervised Alignment of Actions in Video with Text Descriptions.. In International Joint Conference on Artificial Intelligence. Morgan Kaufmann, 2025--2031. Google ScholarDigital Library
- Chen Sun, Sanketh Shetty, Rahul Sukthankar, and Ram Nevatia. 2015. Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images. In Proceedings of the ACM International Conference on Multimedia. ACM, 371--380. Google ScholarDigital Library
- Stefanie Tellex and Deb Roy. 2009. Towards Surveillance Video Search by Natural Language Query. In Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, 1--9. Google ScholarDigital Library
- Atousa Torabi, Niket Tandon, and Leonid Sigal. 2016. Learning Language-Visual Embedding for Movie Understanding with Natural Language. arXiv preprint arXiv:1609.08124 (2016), 1--13.Google Scholar
- Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. 2013. Selective Search for Object Recognition. International Journal of Computer Vision, Vol. 104, 2 (2013), 154--171. Google ScholarDigital Library
- Ran Xu, Caiming Xiong, Wei Chen, and Jason J Corso. 2015. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 2346--2352. Google ScholarDigital Library
- Rong Yan, Alexander G Hauptmann, and Rong Jin. 2003. Negative Pseudo-relevance Feedback in Content-based Video Retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, 343--346. Google ScholarDigital Library
- Rong Yan, Jun Yang, and Alexander G Hauptmann. 2004. Learning Query-class Dependent Weights in Automatic Video Retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, 548--555. Google ScholarDigital Library
- Haonan Yu and Jeffrey Mark Siskind. 2013. Grounded Language Learning from Video Described with Sentences. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 53--63.Google Scholar
- Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018. MAttNet: Modular Attention Network for Referring Expression Comprehension. arXiv preprint arXiv:1801.08186 (2018), 1--14.Google Scholar
- Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. 2016. Modeling Context in Referring Expressions. In Proceedings of the European Conference on Computer Vision. Springer, 69--85.Google ScholarCross Ref
- Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L Berg. 2017. A Joint Speaker Listener-reinforcer Model for Referring Expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7282--7290.Google ScholarCross Ref
- Bohan Zhuang, Qi Wu, Chunhua Shen, Ian Reid, and Anton van den Hengel. 2017. Parallel Attention: A Unified Framework for Visual Object Discovery through Dialogs and Queries. arXiv preprint arXiv:1711.06370 (2017), 1--11.Google Scholar
- C Lawrence Zitnick and Piotr Dollár. 2014. Edge boxes: Locating Object Proposals from Edges. In Proceedings of the European Conference on Computer Vision. Springer, 391--405.Google Scholar
Index Terms
- Cross-modal Moment Localization in Videos
Recommendations
Attentive Moment Retrieval in Videos
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information RetrievalIn the past few years, language-based video retrieval has attracted a lot of attention. However, as a natural extension, localizing the specific video moments within a video given a description query is seldom explored. Although these two tasks look ...
Dual Path Interaction Network for Video Moment Localization
MM '20: Proceedings of the 28th ACM International Conference on MultimediaVideo moment localization aims to localize a specific moment in a video by a natural language query. Previous works either use alignment information to find out the best-matching candidate (i.e., top-down approach) or use discrimination information to ...
Progressive Localization Networks for Language-Based Moment Localization
This article targets the task of language-based video moment localization. The language-based setting of this task allows for an open set of target activities, resulting in a large variation of the temporal lengths of video moments. Most existing methods ...
Comments