research-article

Semantic Reasoning in Zero Example Video Event Retrieval

Authors:
Maaike H. T. De Boer

TNO and Radboud University, The Netherlands

TNO and Radboud University, The Netherlands
View Profile

,
Yi-Jie Lu

City University of Hong Kong, Hong Kong

City University of Hong Kong, Hong Kong
View Profile

,
Hao Zhang

City University of Hong Kong, Hong Kong

City University of Hong Kong, Hong Kong
View Profile

,
Klamer Schutte

TNO Netherlands

TNO Netherlands
View Profile

,
Chong-Wah Ngo

City University of Hong Kong, Hong Kong

City University of Hong Kong, Hong Kong
View Profile

,
Wessel Kraaij

TNO and Leiden University, The Netherlands

TNO and Leiden University, The Netherlands
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 13 Issue 4Article No.: 60pp 1–17https://doi.org/10.1145/3131288

Published:04 October 2017Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Searching in digital video data for high-level events, such as a parade or a car accident, is challenging when the query is textual and lacks visual example images or videos. Current research in deep neural networks is highly beneficial for the retrieval of high-level events using visual examples, but without examples it is still hard to (1) determine which concepts are useful to pre-train (Vocabulary challenge) and (2) which pre-trained concept detectors are relevant for a certain unseen high-level event (Concept Selection challenge). In our article, we present our Semantic Event Retrieval System which (1) shows the importance of high-level concepts in a vocabulary for the retrieval of complex and generic high-level events and (2) uses a novel concept selection method (i-w2v) based on semantic embeddings. Our experiments on the international TRECVID Multimedia Event Detection benchmark show that a diverse vocabulary including high-level concepts improves performance on the retrieval of high-level events in videos and that our novel method outperforms a knowledge-based concept selection method.

References

Robin Aly, Djoerd Hiemstra, Franciska de Jong, and Peter M. G. Apers. 2012. Simulating the future of concept-based video retrieval under improved detector performance. Multimed. Tools Appl. 60, 1 (2012), 203--231. Google ScholarDigital Library
Lamberto Ballan, Marco Bertini, Alberto Del Bimbo, Lorenzo Seidenari, and Giuseppe Serra. 2011. Event detection and recognition for semantic annotation of video. Multimed. Tools Appl. 51, 1 (2011), pp. 279--302. Google ScholarDigital Library
Claudio Carpineto and Giovanni Romano. 2012. A survey of automatic query expansion in information retrieval. ACM Comput. Surv. 44, 1 (2012), 1. Google ScholarDigital Library
Xiaojun Chang, Yi Yang, Alexander G. Hauptmann, Eric P. Xing, and Yao-Liang Yu. 2015. Semantic concept discovery for large-scale zero-shot event detection. In Proceedings of the 24th International Conference on Artificial Intelligence. AAAI Press, 2234--2240.Google Scholar
Xiaojun Chang, Yi Yang, Guodong Long, Chengqi Zhang, and Alexander G. Hauptmann. 2016. Dynamic concept composition for zero-example event detection. In AAAI. 3464--3470.Google Scholar
Jiawei Chen, Yin Cui, Guangnan Ye, Dong Liu, and Shih-Fu Chang. 2014. Event-driven semantic concept discovery by exploiting weakly tagged internet images. In Proceedings of the International Conference on Multimedia Retrieval. ACM, 1. Google ScholarDigital Library
Jeffrey Dalton, James Allan, and Pranav Mirajkar. 2013. Zero-shot video retrieval using content and concepts. In Proceedings of the 22nd ACM International Conference Information & Knowledge Management. ACM, 1857--1860. Google ScholarDigital Library
Maaike de Boer, Klamer Schutte, and Wessel Kraaij. 2015. Knowledge based query expansion in complex multimedia event detection. Multimed. Tools Appl. (2015), 1--19.Google Scholar
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, 248--255.Google ScholarCross Ref
Amirhossein Habibian, Thomas Mensink, and Cees G. M. Snoek. 2014a. Composite concept discovery for zero-shot video event detection. In Proceedings of the International Conference on Multimedia Retrieval. ACM, 17. Google ScholarDigital Library
Amirhossein Habibian, Thomas Mensink, and Cees G. M. Snoek. 2014b. Videostory: A new multimedia embedding for few-example recognition and translation of events. In Proceedings of the International Conference on Multimedia. ACM, 17--26. Google ScholarDigital Library
Amirhossein Habibian, Koen E. A. van de Sande, and Cees G. M. Snoek. 2013. Recommendations for video event recognition using concept vocabularies. In Proceedings of the 3rd International Conference on Multimedia Retrieval. ACM, 89--96. Google ScholarDigital Library
Alexander Hauptmann, Rong Yan, and Wei-Hao Lin. 2007a. How many high-level concepts will fill the semantic gap in news video retrieval?. In Proceedings of the 6th ACM International Conference on Image and Video Retrieval. ACM, 627--634. Google ScholarDigital Library
Alexander Hauptmann, Rong Yan, Wei-Hao Lin, Michael Christel, and Howard Wactlar. 2007b. Can high-level concepts fill the semantic gap in video retrieval? A case study with broadcast news. IEEE Trans. Multimed. 9, 5 (2007), 958--966. Google ScholarDigital Library
Bouke Huurnink, Katja Hofmann, and Maarten De Rijke. 2008. Assessing concept selection for video retrieval. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval. ACM, 459--466. Google ScholarDigital Library
Mihir Jain, Jan C. van Gemert, Thomas Mensink, and Cees G. M. Snoek. 2015. Objects2action: Classifying and localizing actions without any video example. In Proceedings of the IEEE International Conference on Computer Vision. 4588--4596. Google ScholarDigital Library
Lu Jiang, Deyu Meng, Teruko Mitamura, and Alexander G. Hauptmann. 2014a. Easy samples first: Self-paced reranking for zero-example multimedia search. In Proceedings of the ACM International Conference on Multimedia. ACM, 547--556. Google ScholarDigital Library
Lu Jiang, Teruko Mitamura, Shoou-I. Yu, and Alexander G. Hauptmann. 2014b. Zero-example event search using multimodal pseudo relevance feedback. In Proceedings of the International Conference on Multimedia Retrieval. ACM, 297. Google ScholarDigital Library
Lu Jiang, Shoou-I. Yu, Deyu Meng, Teruko Mitamura, and Alexander G. Hauptmann. 2015b. Bridging the ultimate semantic gap: A semantic search engine for internet videos. In Proceedings of the ACM International Conference on Multimedia Retrieval. 27--34. Google ScholarDigital Library
Yu-Gang Jiang, Subhabrata Bhattacharya, Shih-Fu Chang, and Mubarak Shah. 2012. High-level event recognition in unconstrained videos. Int. J. Multimed. Inf. Retriev. (2012), 1--29.Google Scholar
Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shi-Fu Chang. 2017. Exploiting feature and class relationships in video categorization with regularized deep neural networks. In IEEE Transactions on Pattern Analysis and Machine Intelligence. Google ScholarDigital Library
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’14). 1725--1732. Google ScholarDigital Library
Lyndon Kennedy and Alexander Hauptmann. 2006. LSCOM lexicon definitions and annotations (version 1.0). (2006).Google Scholar
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.Google Scholar
Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems. 2177--2185.Google Scholar
Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Ling. 3 (2015), 211--225.Google ScholarCross Ref
Ying Liu, Dengsheng Zhang, Guojun Lu, and Wei-Ying Ma. 2007. A survey of content-based image retrieval with high-level semantics. Pattern Recogn. 40, 1 (2007), 262--282. Google ScholarDigital Library
Yi-Jie Lu, Hao Zhang, Maaike de Boer, and Chong-Wah Ngo. 2016. Event detection with zero example: Select the right and suppress the wrong concepts. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 127--134. Google ScholarDigital Library
Masoud Mazloom, Efstratios Gavves, Koen van de Sande, and Cees Snoek. 2013. Searching informative concept banks for video event detection. In Proceedings of the 3rd International Conference on Multimedia Retrieval. ACM, 255--262. Google ScholarDigital Library
Thomas Mensink, Efstratios Gavves, and Cees G. M. Snoek. 2014. COSTA: Co-occurrence statistics for zero-shot classification. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’14). IEEE, 2441--2448.Google Scholar
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119.Google Scholar
George A. Miller. 1995. WordNet: A lexical database for english. Commun. ACM 38, 11 (1995), pp. 39--41. Google ScholarDigital Library
David Milne and Ian H. Witten. 2013. An open-source toolkit for mining Wikipedia. Artif. Intell. 194 (2013), pp. 222--239. Google ScholarDigital Library
Apostol Paul Natsev, Alexander Haubold, Jelena Tešić, Lexing Xie, and Rong Yan. 2007. Semantic concept-based query expansion and re-ranking for multimedia retrieval. In Proceedings of the 15th International Conference on Multimedia. ACM, 991--1000. Google ScholarDigital Library
Shi-Yong Neo, Jin Zhao, Min-Yen Kan, and Tat-Seng Chua. 2006. Video retrieval using high level features: Exploiting query matching and confidence-based weighting. In International Conference on Image and Video Retrieval. Springer, 143--152. Google ScholarDigital Library
Paul Over, George Awad, Martial Michel, Jonathan Fiscus, Greg Sanders, Wessel Kraaij, Alan F. Smeaton, and Georges Quenot. 2014. TRECVID 2014 -- An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of the Annual TREC Video Retrieval Evaluation (TRECVID’14). NIST, USA.Google Scholar
Paul Over, George Awad, Martial Michel, Jonathan Fiscus, Greg Sanders, Wessel Kraaij, Alan F. Smeaton, Georges Quenot, and Roeland Ordelman. 2015. TRECVID 2015—An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of the Annual TREC Video Retrieval Evaluation (TRECVID’15). NIST.Google Scholar
Pushpa B. Patil and Manesh B. Kokare. 2011. Relevance feedback in content based image retrieval: A review.J. Appl. Comput. Sci. Math. 10, 10 (2011), pp. 40--47.Google Scholar
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14), Vol. 14. 1532--1543.Google Scholar
Alan F. Smeaton, Paul Over, and Wessel Kraaij. 2006. Evaluation campaigns and TRECVid. In Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval. ACM, 321--330. Google ScholarDigital Library
Steve Spagnola and Carl Lagoze. 2011. Edge dependent pathway scoring for calculating semantic similarity in ConceptNet. In Proceedings of the 9th International Conference on Computational Semantics. Association for Computational Linguistics, 385--389.Google Scholar
Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2015. The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817 (2015).Google Scholar
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497. Google ScholarDigital Library
Christos Tzelepis, Damianos Galanopoulos, Vasileios Mezaris, and Ioannis Patras. 2016. Learning to detect video events from zero or very few video examples. Image and Vision Computing 53, 35--44. Google ScholarDigital Library
Shuang Wu, Sravanthi Bondugula, Florian Luisier, Xiaodan Zhuang, and Prem Natarajan. 2014. Zero-shot event detection using multi-modal fusion of weakly supervised concepts. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2665--2672. Google ScholarDigital Library
Shicheng Xu, Huan Li, Xiaojun Chang, Shoou-I. Yu, Xingzhong Du, Xuanchong Li, Lu Jiang, Zexi Mao, Zhenzhong Lan, Susanne Burger, and others. 2015. Incremental multimodal query construction for video search. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 675--678. Google ScholarDigital Library
Yan Yan, Yi Yang, Haoquan Shen, Deyu Meng, Gaowen Liu, Alex Hauptmann, and Nicu Sebe. 2015. Complex event detection via event oriented dictionary learning. In Proceedings of the 29th AAAI Conference on Artificial Intelligence.Google Scholar
Guangnan Ye, Yitong Li, Hongliang Xu, Dong Liu, and Shih-Fu Chang. 2015. Eventnet: A large scale structured concept library for complex event detection in video. In Proceedings of the 23rd Annual ACM Conference on Multimedia Conference. ACM, 471--480. Google ScholarDigital Library
Shoou-I. Yu, Lu Jiang, and Alexander Hauptmann. 2014. Instructional videos for unsupervised harvesting and learning of action examples. In Proceedings of the ACM International Conference on Multimedia. ACM, 825--828.Google ScholarDigital Library
Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems. 487--495.Google Scholar

Index Terms

Semantic Reasoning in Zero Example Video Event Retrieval
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
      1. Query representation
    2. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Video search

Recommendations

Event Detection with Zero Example: Select the Right and Suppress the Wrong Concepts
ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval

Complex video event detection without visual examples is a very challenging issue in multimedia retrieval. We present a state-of-the-art framework for event search without any need of exemplar videos and textual metadata in search corpus. To perform ...
Read More
Fast and Accurate Content-based Semantic Search in 100M Internet Videos
MM '15: Proceedings of the 23rd ACM international conference on Multimedia

Large-scale content-based semantic search in video is an interesting and fundamental problem in multimedia analysis and retrieval. Existing methods index a video by the raw concept detection score that is dense and inconsistent, and thus cannot scale to ...
Read More
Zero-Example Multimedia Event Detection and Recounting with Unsupervised Evidence Localization
MM '16: Proceedings of the 24th ACM international conference on Multimedia

Retrieval of a complex multimedia event has long been regarded as a challenging task. Multimedia event recounting, other than event detection, focuses on providing comprehensible evidence which justifies a detection result. Recounting enables "video ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 13, Issue 4
November 2017
362 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3129737
Editor:
Alberto Del Bimbo
University of Firenze, Italy
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 October 2017
- Accepted: 1 July 2017
- Revised: 1 May 2017
- Received: 1 July 2016
Published in tomm Volume 13, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Content-based visual information retrieval
multimedia event detection
semantics
zero shot
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 181
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Semantic Reasoning in Zero Example Video Event Retrieval

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

References

Cited By

Index Terms

Recommendations

Event Detection with Zero Example: Select the Right and Suppress the Wrong Concepts

Fast and Accurate Content-based Semantic Search in 100M Internet Videos

Zero-Example Multimedia Event Detection and Recounting with Unsupervised Evidence Localization