skip to main content
10.1145/3240508.3240549acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Cross-modal Moment Localization in Videos

Authors Info & Claims
Published:15 October 2018Publication History

ABSTRACT

In this paper, we address the temporal moment localization issue, namely, localizing a video moment described by a natural language query in an untrimmed video. This is a general yet challenging vision-language task since it requires not only the localization of moments, but also the multimodal comprehension of textual-temporal information (e.g., "first" and "leaving") that helps to distinguish the desired moment from the others, especially those with the similar visual content. While existing studies treat a given language query as a single unit, we propose to decompose it into two components: the relevant cue related to the desired moment localization and the irrelevant one meaningless to the localization. This allows us to flexibly adapt to arbitrary queries in an end-to-end framework. In our proposed model, a language-temporal attention network is utilized to learn the word attention based on the temporal context information in the video. Therefore, our model can automatically select "what words to listen to" for localizing the desired moment. We evaluate the proposed model on two public benchmark datasets: DiDeMo and Charades-STA. The experimental results verify its superiority over several state-of-the-art methods.

References

  1. Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing Moments in Video With Natural Language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5803--5812.Google ScholarGoogle ScholarCross RefCross Ref
  2. Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques, and Jitendra Malik. 2014. Multiscale Combinatorial Grouping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 328--335. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Y Alp Aslandogan and Clement T. Yu. 1999. Techniques and Systems for Image and Video Retrieval. IEEE Transactions on Knowledge and Data Engineering, Vol. 11, 1 (1999), 56--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Piotr Bojanowski, Rémi Lajugie, Edouard Grave, Francis Bach, Ivan Laptev, Jean Ponce, and Cordelia Schmid. 2015. Weakly-supervised Alignment of Video with Text. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 4462--4470. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Maaike HT De Boer, Yi-Jie Lu, Hao Zhang, Klamer Schutte, Chong-Wah Ngo, and Wessel Kraaij. 2017. Semantic Reasoning in Zero Example Video Event Retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 13, 4 (2017), 1--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. 2016. Daps: Deep Action Proposals for Action Understanding. In Proceedings of the European Conference on Computer Vision. Springer, 768--784.Google ScholarGoogle ScholarCross RefCross Ref
  7. Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017a. TALL: Temporal Activity Localization via Language Query. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5267--5275.Google ScholarGoogle ScholarCross RefCross Ref
  8. Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram Nevatia. 2017b. TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 3628--3636.Google ScholarGoogle ScholarCross RefCross Ref
  9. Haiyun Guo, Jinqiao Wang, Min Xu, Zheng-Jun Zha, and Hanqing Lu. 2015. Learning Multi-view Deep Features for Small Object Retrieval in Surveillance Scenarios. In Proceedings of the ACM International Conference on Multimedia. ACM, 859--862. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. 2017. Modeling Relationships in Referential Expressions with Compositional Modular Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4418--4427.Google ScholarGoogle ScholarCross RefCross Ref
  11. Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. 2016. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4555--4564.Google ScholarGoogle ScholarCross RefCross Ref
  12. Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought Vectors. In Proceedings of the Advances in Neural Information Processing Systems. NIPS, 3294--3302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Philipp Krahenbühl and Vladlen Koltun. 2014. Geodesic object proposals. In Proceedings of the European Conference on Computer Vision. Springer, 725--739.Google ScholarGoogle ScholarCross RefCross Ref
  14. Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urtasun. 2014. Visual Semantic Search: Retrieving Videos via Complex Textual Queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2657--2664. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single Shot Temporal Action Detection. In Proceedings of the ACM International Conference on Multimedia. ACM, 988--996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Meng Liu, Liqiang Nie, Meng Wang, and Baoquan Chen. 2017. Towards micro-video understanding by joint sequential-sparse modeling. In Proceedings of the ACM International Conference on Multimedia. ACM, 970--978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ruotian Luo and Gregory Shakhnarovich. 2017. Comprehension-guided Referring Expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7102--7111.Google ScholarGoogle ScholarCross RefCross Ref
  18. Xin Luo, Liqiang Nie, Xiangnan He, Ye Wu, Zhen-Duo Chen, and Xin-Shun Xu. 2018. Fast Scalable Supervised Hashing.. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 735--744. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Shugao Ma, Leonid Sigal, and Stan Sclaroff. 2016. Learning Activity Progression in LSTMs for Activity Detection and Early Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1942--1950.Google ScholarGoogle ScholarCross RefCross Ref
  20. Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and Comprehension of Unambiguous Object Descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 11--20.Google ScholarGoogle ScholarCross RefCross Ref
  21. Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. 2016. Modeling Context between Objects for Referring Expression Understanding. In Proceedings of the European Conference on Computer Vision. Springer, 792--807.Google ScholarGoogle ScholarCross RefCross Ref
  22. Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkila, and Naokazu Yokoya. 2016. Learning Joint Representations of Videos and Sentences with Web Image Search. In Proceedings of the European Conference on Computer Vision. Springer, 651--667.Google ScholarGoogle ScholarCross RefCross Ref
  23. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1532--1543.Google ScholarGoogle ScholarCross RefCross Ref
  24. Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association of Computational Linguistics, Vol. 1 (2013), 25--36.Google ScholarGoogle ScholarCross RefCross Ref
  25. Remi Ronfard. 2004. Reading Movies: An Integrated DVD Player for Browsing Movies and Their Scripts. In Proceedings of the ACM International Conference on Multimedia. ACM, 740--741. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Klaus Schoeffmann and Frank Hopfgartner. 2015. Interactive Video Search. In Proceedings of the ACM International Conference on Multimedia. ACM, 1321--1322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1049--1058.Google ScholarGoogle ScholarCross RefCross Ref
  28. Karen Simonyan and Andrew Zisserman. 2014a. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Advances in Neural Information Processing Systems. NIPS, 568--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Karen Simonyan and Andrew Zisserman. 2014b. Very Deep Convolutional Networks for Large-scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014), 1--14.Google ScholarGoogle Scholar
  30. Bharat Singh, Tim K Marks, Michael Jones, Oncel Tuzel, and Ming Shao. 2016. A Multi-stream Bi-directional Recurrent Neural Network for Fine-grained Action Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1961--1970.Google ScholarGoogle ScholarCross RefCross Ref
  31. Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. 2011. Multiple Feature Hashing for Real-time Large Scale Near-Duplicate Video Retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, 423--432. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Young Chol Song, Iftekhar Naim, Abdullah Al Mamun, Kaustubh Kulkarni, Parag Singla, Jiebo Luo, Daniel Gildea, and Henry A Kautz. 2016. Unsupervised Alignment of Actions in Video with Text Descriptions.. In International Joint Conference on Artificial Intelligence. Morgan Kaufmann, 2025--2031. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Chen Sun, Sanketh Shetty, Rahul Sukthankar, and Ram Nevatia. 2015. Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images. In Proceedings of the ACM International Conference on Multimedia. ACM, 371--380. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Stefanie Tellex and Deb Roy. 2009. Towards Surveillance Video Search by Natural Language Query. In Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Atousa Torabi, Niket Tandon, and Leonid Sigal. 2016. Learning Language-Visual Embedding for Movie Understanding with Natural Language. arXiv preprint arXiv:1609.08124 (2016), 1--13.Google ScholarGoogle Scholar
  36. Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. 2013. Selective Search for Object Recognition. International Journal of Computer Vision, Vol. 104, 2 (2013), 154--171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Ran Xu, Caiming Xiong, Wei Chen, and Jason J Corso. 2015. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 2346--2352. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Rong Yan, Alexander G Hauptmann, and Rong Jin. 2003. Negative Pseudo-relevance Feedback in Content-based Video Retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, 343--346. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Rong Yan, Jun Yang, and Alexander G Hauptmann. 2004. Learning Query-class Dependent Weights in Automatic Video Retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, 548--555. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Haonan Yu and Jeffrey Mark Siskind. 2013. Grounded Language Learning from Video Described with Sentences. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 53--63.Google ScholarGoogle Scholar
  41. Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018. MAttNet: Modular Attention Network for Referring Expression Comprehension. arXiv preprint arXiv:1801.08186 (2018), 1--14.Google ScholarGoogle Scholar
  42. Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. 2016. Modeling Context in Referring Expressions. In Proceedings of the European Conference on Computer Vision. Springer, 69--85.Google ScholarGoogle ScholarCross RefCross Ref
  43. Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L Berg. 2017. A Joint Speaker Listener-reinforcer Model for Referring Expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7282--7290.Google ScholarGoogle ScholarCross RefCross Ref
  44. Bohan Zhuang, Qi Wu, Chunhua Shen, Ian Reid, and Anton van den Hengel. 2017. Parallel Attention: A Unified Framework for Visual Object Discovery through Dialogs and Queries. arXiv preprint arXiv:1711.06370 (2017), 1--11.Google ScholarGoogle Scholar
  45. C Lawrence Zitnick and Piotr Dollár. 2014. Edge boxes: Locating Object Proposals from Edges. In Proceedings of the European Conference on Computer Vision. Springer, 391--405.Google ScholarGoogle Scholar

Index Terms

  1. Cross-modal Moment Localization in Videos

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '18: Proceedings of the 26th ACM international conference on Multimedia
        October 2018
        2167 pages
        ISBN:9781450356657
        DOI:10.1145/3240508

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 October 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        MM '18 Paper Acceptance Rate209of757submissions,28%Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader