research-article

Cross-modal Moment Localization in Videos

Authors:
Meng Liu

Shandong University, Qingdao, China

Shandong University, Qingdao, China
View Profile

,
Xiang Wang

National University of Singapore, Singapore, Singapore

National University of Singapore, Singapore, Singapore
View Profile

,
Liqiang Nie

Shandong University, Qingdao, China

Shandong University, Qingdao, China
View Profile

,
Qi Tian

Huawei Noah's Ark Lab & University of Texas at San Antonio, Beijing, China

Huawei Noah's Ark Lab & University of Texas at San Antonio, Beijing, China
View Profile

,
Baoquan Chen

Peking University & Shandong University, Beijing, China

Peking University & Shandong University, Beijing, China
View Profile

,
Tat-Seng Chua

National University of Singapore, Singapore, Singapore

National University of Singapore, Singapore, Singapore
View Profile

MM '18: Proceedings of the 26th ACM international conference on MultimediaOctober 2018Pages 843–851https://doi.org/10.1145/3240508.3240549

Published:15 October 2018Publication History

MM '18: Proceedings of the 26th ACM international conference on Multimedia

Pages 843–851

ABSTRACT

In this paper, we address the temporal moment localization issue, namely, localizing a video moment described by a natural language query in an untrimmed video. This is a general yet challenging vision-language task since it requires not only the localization of moments, but also the multimodal comprehension of textual-temporal information (e.g., "first" and "leaving") that helps to distinguish the desired moment from the others, especially those with the similar visual content. While existing studies treat a given language query as a single unit, we propose to decompose it into two components: the relevant cue related to the desired moment localization and the irrelevant one meaningless to the localization. This allows us to flexibly adapt to arbitrary queries in an end-to-end framework. In our proposed model, a language-temporal attention network is utilized to learn the word attention based on the temporal context information in the video. Therefore, our model can automatically select "what words to listen to" for localizing the desired moment. We evaluate the proposed model on two public benchmark datasets: DiDeMo and Charades-STA. The experimental results verify its superiority over several state-of-the-art methods.

References

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing Moments in Video With Natural Language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5803--5812.Google ScholarCross Ref
Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques, and Jitendra Malik. 2014. Multiscale Combinatorial Grouping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 328--335. Google ScholarDigital Library
Y Alp Aslandogan and Clement T. Yu. 1999. Techniques and Systems for Image and Video Retrieval. IEEE Transactions on Knowledge and Data Engineering, Vol. 11, 1 (1999), 56--63. Google ScholarDigital Library
Piotr Bojanowski, Rémi Lajugie, Edouard Grave, Francis Bach, Ivan Laptev, Jean Ponce, and Cordelia Schmid. 2015. Weakly-supervised Alignment of Video with Text. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 4462--4470. Google ScholarDigital Library
Maaike HT De Boer, Yi-Jie Lu, Hao Zhang, Klamer Schutte, Chong-Wah Ngo, and Wessel Kraaij. 2017. Semantic Reasoning in Zero Example Video Event Retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 13, 4 (2017), 1--17. Google ScholarDigital Library
Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. 2016. Daps: Deep Action Proposals for Action Understanding. In Proceedings of the European Conference on Computer Vision. Springer, 768--784.Google ScholarCross Ref
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017a. TALL: Temporal Activity Localization via Language Query. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5267--5275.Google ScholarCross Ref
Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram Nevatia. 2017b. TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 3628--3636.Google ScholarCross Ref
Haiyun Guo, Jinqiao Wang, Min Xu, Zheng-Jun Zha, and Hanqing Lu. 2015. Learning Multi-view Deep Features for Small Object Retrieval in Surveillance Scenarios. In Proceedings of the ACM International Conference on Multimedia. ACM, 859--862. Google ScholarDigital Library
Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. 2017. Modeling Relationships in Referential Expressions with Compositional Modular Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4418--4427.Google ScholarCross Ref
Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. 2016. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4555--4564.Google ScholarCross Ref
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought Vectors. In Proceedings of the Advances in Neural Information Processing Systems. NIPS, 3294--3302. Google ScholarDigital Library
Philipp Krahenbühl and Vladlen Koltun. 2014. Geodesic object proposals. In Proceedings of the European Conference on Computer Vision. Springer, 725--739.Google ScholarCross Ref
Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urtasun. 2014. Visual Semantic Search: Retrieving Videos via Complex Textual Queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2657--2664. Google ScholarDigital Library
Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single Shot Temporal Action Detection. In Proceedings of the ACM International Conference on Multimedia. ACM, 988--996. Google ScholarDigital Library
Meng Liu, Liqiang Nie, Meng Wang, and Baoquan Chen. 2017. Towards micro-video understanding by joint sequential-sparse modeling. In Proceedings of the ACM International Conference on Multimedia. ACM, 970--978. Google ScholarDigital Library
Ruotian Luo and Gregory Shakhnarovich. 2017. Comprehension-guided Referring Expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7102--7111.Google ScholarCross Ref
Xin Luo, Liqiang Nie, Xiangnan He, Ye Wu, Zhen-Duo Chen, and Xin-Shun Xu. 2018. Fast Scalable Supervised Hashing.. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 735--744. Google ScholarDigital Library
Shugao Ma, Leonid Sigal, and Stan Sclaroff. 2016. Learning Activity Progression in LSTMs for Activity Detection and Early Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1942--1950.Google ScholarCross Ref
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and Comprehension of Unambiguous Object Descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 11--20.Google ScholarCross Ref
Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. 2016. Modeling Context between Objects for Referring Expression Understanding. In Proceedings of the European Conference on Computer Vision. Springer, 792--807.Google ScholarCross Ref
Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkila, and Naokazu Yokoya. 2016. Learning Joint Representations of Videos and Sentences with Web Image Search. In Proceedings of the European Conference on Computer Vision. Springer, 651--667.Google ScholarCross Ref
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1532--1543.Google ScholarCross Ref
Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association of Computational Linguistics, Vol. 1 (2013), 25--36.Google ScholarCross Ref
Remi Ronfard. 2004. Reading Movies: An Integrated DVD Player for Browsing Movies and Their Scripts. In Proceedings of the ACM International Conference on Multimedia. ACM, 740--741. Google ScholarDigital Library
Klaus Schoeffmann and Frank Hopfgartner. 2015. Interactive Video Search. In Proceedings of the ACM International Conference on Multimedia. ACM, 1321--1322. Google ScholarDigital Library
Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1049--1058.Google ScholarCross Ref
Karen Simonyan and Andrew Zisserman. 2014a. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Advances in Neural Information Processing Systems. NIPS, 568--576. Google ScholarDigital Library
Karen Simonyan and Andrew Zisserman. 2014b. Very Deep Convolutional Networks for Large-scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014), 1--14.Google Scholar
Bharat Singh, Tim K Marks, Michael Jones, Oncel Tuzel, and Ming Shao. 2016. A Multi-stream Bi-directional Recurrent Neural Network for Fine-grained Action Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1961--1970.Google ScholarCross Ref
Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. 2011. Multiple Feature Hashing for Real-time Large Scale Near-Duplicate Video Retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, 423--432. Google ScholarDigital Library
Young Chol Song, Iftekhar Naim, Abdullah Al Mamun, Kaustubh Kulkarni, Parag Singla, Jiebo Luo, Daniel Gildea, and Henry A Kautz. 2016. Unsupervised Alignment of Actions in Video with Text Descriptions.. In International Joint Conference on Artificial Intelligence. Morgan Kaufmann, 2025--2031. Google ScholarDigital Library
Chen Sun, Sanketh Shetty, Rahul Sukthankar, and Ram Nevatia. 2015. Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images. In Proceedings of the ACM International Conference on Multimedia. ACM, 371--380. Google ScholarDigital Library
Stefanie Tellex and Deb Roy. 2009. Towards Surveillance Video Search by Natural Language Query. In Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, 1--9. Google ScholarDigital Library
Atousa Torabi, Niket Tandon, and Leonid Sigal. 2016. Learning Language-Visual Embedding for Movie Understanding with Natural Language. arXiv preprint arXiv:1609.08124 (2016), 1--13.Google Scholar
Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. 2013. Selective Search for Object Recognition. International Journal of Computer Vision, Vol. 104, 2 (2013), 154--171. Google ScholarDigital Library
Ran Xu, Caiming Xiong, Wei Chen, and Jason J Corso. 2015. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 2346--2352. Google ScholarDigital Library
Rong Yan, Alexander G Hauptmann, and Rong Jin. 2003. Negative Pseudo-relevance Feedback in Content-based Video Retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, 343--346. Google ScholarDigital Library
Rong Yan, Jun Yang, and Alexander G Hauptmann. 2004. Learning Query-class Dependent Weights in Automatic Video Retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, 548--555. Google ScholarDigital Library
Haonan Yu and Jeffrey Mark Siskind. 2013. Grounded Language Learning from Video Described with Sentences. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 53--63.Google Scholar
Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018. MAttNet: Modular Attention Network for Referring Expression Comprehension. arXiv preprint arXiv:1801.08186 (2018), 1--14.Google Scholar
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. 2016. Modeling Context in Referring Expressions. In Proceedings of the European Conference on Computer Vision. Springer, 69--85.Google ScholarCross Ref
Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L Berg. 2017. A Joint Speaker Listener-reinforcer Model for Referring Expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7282--7290.Google ScholarCross Ref
Bohan Zhuang, Qi Wu, Chunhua Shen, Ian Reid, and Anton van den Hengel. 2017. Parallel Attention: A Unified Framework for Visual Object Discovery through Dialogs and Queries. arXiv preprint arXiv:1711.06370 (2017), 1--11.Google Scholar
C Lawrence Zitnick and Piotr Dollár. 2014. Edge boxes: Locating Object Proposals from Edges. In Proceedings of the European Conference on Computer Vision. Springer, 391--405.Google Scholar

Index Terms

Cross-modal Moment Localization in Videos
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Novelty in information retrieval
    2. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Video search

Recommendations

Attentive Moment Retrieval in Videos
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

In the past few years, language-based video retrieval has attracted a lot of attention. However, as a natural extension, localizing the specific video moments within a video given a description query is seldom explored. Although these two tasks look ...
Read More
Dual Path Interaction Network for Video Moment Localization
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Video moment localization aims to localize a specific moment in a video by a natural language query. Previous works either use alignment information to find out the best-matching candidate (i.e., top-down approach) or use discrimination information to ...
Read More
Progressive Localization Networks for Language-Based Moment Localization
This article targets the task of language-based video moment localization. The language-based setting of this task allows for an open set of target activities, resulting in a large variation of the temporal lengths of video moments. Most existing methods ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '18: Proceedings of the 26th ACM international conference on Multimedia
October 2018
2167 pages
ISBN:9781450356657
DOI:10.1145/3240508
General Chairs:
Susanne Boll
University of Oldenburg, Germany
,
Kyoung Mu Lee
Seoul National University, Korea
,
Jiebo Luo
University of Rochester, USA
,
Wenwu Zhu
Tsinghua University, China
,
Program Chairs:
Hyeran Byun
Yonsei University, Korea
,
Chang Wen Chen
State Univ. Of New York at Buffalo, USA
,
Rainer Lienhart
University of Augsburg, Germany
,
Tao Mei
JD AI, China
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 October 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cross-modal video retrieval
language-temporal attention
moment localization
Qualifiers
- research-article
Conference

Acceptance Rates
MM '18 Paper Acceptance Rate209of757submissions,28%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 130
  Total Citations
  View Citations
- 950
  Total Downloads
- Downloads (Last 12 months)86
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cross-modal Moment Localization in Videos

MM '18: Proceedings of the 26th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Attentive Moment Retrieval in Videos

Dual Path Interaction Network for Video Moment Localization

Progressive Localization Networks for Language-Based Moment Localization