skip to main content
10.1145/3123266.3123313acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Enhancing Micro-video Understanding by Harnessing External Sounds

Authors Info & Claims
Published:19 October 2017Publication History

ABSTRACT

Different from traditional long videos, micro-videos are much shorter and usually recorded at a specific place with mobile devices. To better understand the semantics of a micro-video and facilitate downstream applications, it is crucial to estimate the venue where the micro-video is recorded, for example, in a concert or on a beach. However, according to our statistics over two million micro-videos, only $1.22%$ of them were labeled with location information. For the remaining large number of micro-videos without location information, we have to rely on their content to estimate their venue categories. This is a highly challenging task, as micro-videos are naturally multi-modal (with textual, visual and, acoustic content), and more importantly, the quality of each modality varies greatly for different micro-videos.

In this work, we focus on enhancing the acoustic modality for the venue category estimation task. This is motivated by our finding that although the acoustic signal can well complement the visual and textual signal in reflecting a micro-video's venue, its quality is usually relatively lower. As such, simply integrating acoustic features with visual and textual features only leads to suboptimal results, or even adversely degrades the overall performance (cf the barrel theory). To address this, we propose to compensate the shortest board --- the acoustic modality --- via harnessing the external sound knowledge. We develop a deep transfer model which can jointly enhance the concept-level representation of micro-videos and the venue category prediction. To alleviate the sparsity problem of unpopular categories, we further regularize the representation learning of micro-videos of the same venue category. Through extensive experiments on a real-world dataset, we show that our model significantly outperforms the state-of-the-art method in terms of both Micro-F1 and Macro-F1 scores by leveraging the external acoustic knowledge.

References

  1. Khalid Ashraf, Benjamin Elizalde, Forrest Iandola, Matthew Moskewicz, Julia Bernd, Gerald Friedland, and Kurt Keutzer. 2015. Audio-based multimedia event detection with DNNs and sparse sampling ICMR. 611--614. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Soheil Bahrampour, Nasser M Nasrabadi, Asok Ray, and William Kenneth Jenkins. 2016. Multimodal task-driven dictionary learning for image classification. TIP, Vol. 25, 1 (2016), 24--38.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Susanne Burger, Qin Jin, Peter F Schulam, and Florian Metze. 2012. Noisemes: Manual annotation of environmental noise in audio streams. Technical report Carnegie Mellon University-LTI-12-07 (2012), 1--5.Google ScholarGoogle Scholar
  4. Song Cao and Noah Snavely. 2013. Graph-based discriminative learning for location recognition CVPR. 700--707. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Diego Castan and Murat Akbacak. 2013. Segmental-GMM Approach based on Acoustic Concept Segmentation SLAM@ INTERSPEECH. 15--19.Google ScholarGoogle Scholar
  6. Sourish Chaudhuri and Bhiksha Raj. 2012. Unsupervised structure discovery for semantic analysis of audio NIPS. 1178--1186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jingyuan Chen, Xuemeng Song, Liqiang Nie, Xiang Wang, Hanwang Zhang, and Tat-Seng Chua. 2016. Micro Tells Macro: Predicting the Popularity of Micro-Videos via a Transductive Model MM. 898--907. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ning Chen, Jun Zhu, and Eric P Xing. 2010. Predictive subspace learning for multi-view data: a large margin approach NIPS. 361--369. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jaeyoung Choi, Gerald Friedland, Venkatesan Ekambaram, and Kannan Ramchandran. 2012. Multimodal location estimation of consumer media: Dealing with sparse training data ICME. 43--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition ICML. 647--655. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Elad and M. Aharon. 2006. Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries. TIP, Vol. 15, 12 (2006), 3736--3745. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Fuli Feng, Liqiang Nie, Xiang Wang, Richang Hong, and Tat-Seng Chua. 2017. Computational social indicators: a case study of Chinese university ranking SIGIR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Gerald Friedland, Jaeyoung Choi, Howard Lei, and Adam Janin. 2011. Multimodal location estimation on Flickr videos. MM. 23--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Siddharth Gopal and Yiming Yang. 2013. Recursive Regularization for Large-scale Classification with Hierarchical and Graphical Dependencies. In SIGKDD. 257--265. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. James Hays and Alexei A Efros. 2008. IM2GPS: estimating geographic information from a single image CVPR. 1--8.Google ScholarGoogle Scholar
  16. Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In WWW. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Chua Tat-Seng. 2016. Fast Matrix Factorization for Online Recommendation with Implicit Feedback SIGIR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Adam Kilgarriff and Christiane Fellbaum. 2000. WordNet: An Electronic Lexical Database. (2000).Google ScholarGoogle Scholar
  19. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks NIPS. 1106--1114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Anan Liu, Weizhi Nie, Yue Gao, and Yuting Su. 2016. Multi-Modal Clique-Graph Matching for View-Based 3D Model Retrieval. TIP, Vol. 25, 5 (2016), 2103--2116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Anan Liu, Yuting Su, Weizhi Nie, and Mohan S. Kankanhalli. 2017. Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition. TPAMI, Vol. 39, 1 (2017), 102--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Gaowen Liu, Yan Yan, Elisa Ricci, Yi Yang, Yahong Han, Stefan Winkler, and Nicu Sebe. 2015. Inferring Painting Style with Multi-task Dictionary Learning IJCAI. 2162--2168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. 2015. Learning Transferable Features with Deep Adaptation Networks ICML. 97--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Mairal, F. Bach, and J. Ponce. 2012 a. Task-Driven Dictionary Learning. TPAMI, Vol. 34, 4 (2012), 791--804. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. 2009. Online Dictionary Learning for Sparse Coding. In ICML. 689--696. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Julien Mairal, Francis R. Bach, and Jean Ponce. 2012 b. Task-Driven Dictionary Learning. TPAMI, Vol. 34, 4 (2012), 791--804. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Julien Mairal, Michael Elad, and Guillermo Sapiro. 2008. Sparse representation for color image restoration. TIP, Vol. 17, 1 (2008), 53--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Julien Mairal, Jean Ponce, Guillermo Sapiro, Andrew Zisserman, and Francis R. Bach. 2009. Supervised Dictionary Learning. NIPS. 1033--1040. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Annamaria Mesaros, Toni Heittola, Antti Eronen, and Tuomas Virtanen. 2010 a. Acoustic event detection in real life recordings. EUSIPCO. 1267--1271.Google ScholarGoogle Scholar
  30. Annamaria Mesaros, Toni Heittola, Antti J. Eronen, and Tuomas Virtanen. 2010 b. Acoustic event detection in real life recordings. EUSIPCO. 1267--1271.Google ScholarGoogle Scholar
  31. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and Their Compositionality NIPS. 3111--3119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Gianluca Monaci, Philippe Jost, Pierre Vandergheynst, Boris Mailhe, Sylvain Lesage, and Rémi Gribonval. 2007. Learning multimodal dictionaries. TIP, Vol. 16, 9 (2007), 2272--2283. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Stephanie Lynne Pancoast, Murat Akbacak, and Michelle Hewlett Sanchez. 2012. Supervised acoustic concept extraction for multimedia event detection Proceedings of the 2012 ACM international workshop on Audio and multimedia methods for large-scale video analysis. ACM, 9--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Mirco Ravanelli, Benjamin Elizalde, Karl Ni, and Gerald Friedland. 2014. Audio concept classification with hierarchical deep neural networks EUSIPCO. 606--610.Google ScholarGoogle Scholar
  35. S. Sadanand and J. J. Corso. 2012. Action bank: A high-level representation of activity in video CVPR. 1234--1241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Xuemeng Song, Liqiang Nie, Luming Zhang, Mohammad Akbari, and Tat-Seng Chua. 2015. Multiple social network learning and its application in volunteerism tendency prediction SIGIR. 213--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Meng Wang, Xian-Sheng Hua, Richang Hong, Jinhui Tang, Guo-Jun Qi, and Yan Song. 2009. Unified video annotation via multigraph learning. TCSVT, Vol. 19, 5 (2009), 733--746. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Meng Wang, Hao Li, Dacheng Tao, Ke Lu, and Xindong Wu. 2012. Multimodal graph-based reranking for web image search. TIP, Vol. 21, 11 (2012), 4649--4661. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Xiang Wang, Xiangnan He, Liqiang Nie, and Tat-Seng Chua. 2017. Item Silk Road: Recommending Items from Information Domains to Social Users. (2017).Google ScholarGoogle Scholar
  40. Xiang Wang, Liqiang Nie, Xuemeng Song, Dongxiang Zhang, and Tat-Seng Chua. 2017. Unifying virtual and physical worlds: Learning toward local and global consistency. TOIS, Vol. 36, 1 (2017), 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yipei Wang, Shourabh Rawat, and Florian Metze. 2014. Exploring audio semantic concepts for event-based video retrieval ICASSP. 1360--1364.Google ScholarGoogle Scholar
  42. Meng Yang, Weiyang Liu, Weixin Luo, and Linlin Shen. 2016 b. Analysis-Synthesis Dictionary Learning for Universality-Particularity Representation Based Classification. In AAAI. 2251--2257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. 2016 a. Revisiting Semi-Supervised Learning with Graph Embeddings ICML. 40--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? NIPS. 3320--3328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017. Visual Translation Embedding Network for Visual Relation Detection CVPR.Google ScholarGoogle Scholar
  46. Hanwang Zhang, Zheng-Jun Zha, Yang Yang, Shuicheng Yan, and Tat-Seng Chua. 2014. Robust (semi) nonnegative graph embedding. TIP, Vol. 23, 7 (2014), 2996--3012.Google ScholarGoogle ScholarCross RefCross Ref
  47. Jianglong Zhang, Liqiang Nie, Xiang Wang, Xiangnan He, Xianglin Huang, and Tat-Seng Chua. 2016. Shorter-is-Better: Venue Category Estimation from Micro-Video MM. 1415--1424. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Yueting Zhuang, Yanfei Wang, Fei Wu, Yin Zhang, and Weiming Lu. 2013. Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval AAAI. 1070--1076. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Enhancing Micro-video Understanding by Harnessing External Sounds

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '17: Proceedings of the 25th ACM international conference on Multimedia
          October 2017
          2028 pages
          ISBN:9781450349062
          DOI:10.1145/3123266

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 19 October 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          MM '17 Paper Acceptance Rate189of684submissions,28%Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader