ABSTRACT
Different from traditional long videos, micro-videos are much shorter and usually recorded at a specific place with mobile devices. To better understand the semantics of a micro-video and facilitate downstream applications, it is crucial to estimate the venue where the micro-video is recorded, for example, in a concert or on a beach. However, according to our statistics over two million micro-videos, only $1.22%$ of them were labeled with location information. For the remaining large number of micro-videos without location information, we have to rely on their content to estimate their venue categories. This is a highly challenging task, as micro-videos are naturally multi-modal (with textual, visual and, acoustic content), and more importantly, the quality of each modality varies greatly for different micro-videos.
In this work, we focus on enhancing the acoustic modality for the venue category estimation task. This is motivated by our finding that although the acoustic signal can well complement the visual and textual signal in reflecting a micro-video's venue, its quality is usually relatively lower. As such, simply integrating acoustic features with visual and textual features only leads to suboptimal results, or even adversely degrades the overall performance (cf the barrel theory). To address this, we propose to compensate the shortest board --- the acoustic modality --- via harnessing the external sound knowledge. We develop a deep transfer model which can jointly enhance the concept-level representation of micro-videos and the venue category prediction. To alleviate the sparsity problem of unpopular categories, we further regularize the representation learning of micro-videos of the same venue category. Through extensive experiments on a real-world dataset, we show that our model significantly outperforms the state-of-the-art method in terms of both Micro-F1 and Macro-F1 scores by leveraging the external acoustic knowledge.
- Khalid Ashraf, Benjamin Elizalde, Forrest Iandola, Matthew Moskewicz, Julia Bernd, Gerald Friedland, and Kurt Keutzer. 2015. Audio-based multimedia event detection with DNNs and sparse sampling ICMR. 611--614. Google ScholarDigital Library
- Soheil Bahrampour, Nasser M Nasrabadi, Asok Ray, and William Kenneth Jenkins. 2016. Multimodal task-driven dictionary learning for image classification. TIP, Vol. 25, 1 (2016), 24--38.Google ScholarDigital Library
- Susanne Burger, Qin Jin, Peter F Schulam, and Florian Metze. 2012. Noisemes: Manual annotation of environmental noise in audio streams. Technical report Carnegie Mellon University-LTI-12-07 (2012), 1--5.Google Scholar
- Song Cao and Noah Snavely. 2013. Graph-based discriminative learning for location recognition CVPR. 700--707. Google ScholarDigital Library
- Diego Castan and Murat Akbacak. 2013. Segmental-GMM Approach based on Acoustic Concept Segmentation SLAM@ INTERSPEECH. 15--19.Google Scholar
- Sourish Chaudhuri and Bhiksha Raj. 2012. Unsupervised structure discovery for semantic analysis of audio NIPS. 1178--1186. Google ScholarDigital Library
- Jingyuan Chen, Xuemeng Song, Liqiang Nie, Xiang Wang, Hanwang Zhang, and Tat-Seng Chua. 2016. Micro Tells Macro: Predicting the Popularity of Micro-Videos via a Transductive Model MM. 898--907. Google ScholarDigital Library
- Ning Chen, Jun Zhu, and Eric P Xing. 2010. Predictive subspace learning for multi-view data: a large margin approach NIPS. 361--369. Google ScholarDigital Library
- Jaeyoung Choi, Gerald Friedland, Venkatesan Ekambaram, and Kannan Ramchandran. 2012. Multimodal location estimation of consumer media: Dealing with sparse training data ICME. 43--48. Google ScholarDigital Library
- Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition ICML. 647--655. Google ScholarDigital Library
- M. Elad and M. Aharon. 2006. Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries. TIP, Vol. 15, 12 (2006), 3736--3745. Google ScholarDigital Library
- Fuli Feng, Liqiang Nie, Xiang Wang, Richang Hong, and Tat-Seng Chua. 2017. Computational social indicators: a case study of Chinese university ranking SIGIR. Google ScholarDigital Library
- Gerald Friedland, Jaeyoung Choi, Howard Lei, and Adam Janin. 2011. Multimodal location estimation on Flickr videos. MM. 23--28. Google ScholarDigital Library
- Siddharth Gopal and Yiming Yang. 2013. Recursive Regularization for Large-scale Classification with Hierarchical and Graphical Dependencies. In SIGKDD. 257--265. Google ScholarDigital Library
- James Hays and Alexei A Efros. 2008. IM2GPS: estimating geographic information from a single image CVPR. 1--8.Google Scholar
- Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In WWW. Google ScholarDigital Library
- Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Chua Tat-Seng. 2016. Fast Matrix Factorization for Online Recommendation with Implicit Feedback SIGIR. Google ScholarDigital Library
- Adam Kilgarriff and Christiane Fellbaum. 2000. WordNet: An Electronic Lexical Database. (2000).Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks NIPS. 1106--1114. Google ScholarDigital Library
- Anan Liu, Weizhi Nie, Yue Gao, and Yuting Su. 2016. Multi-Modal Clique-Graph Matching for View-Based 3D Model Retrieval. TIP, Vol. 25, 5 (2016), 2103--2116. Google ScholarDigital Library
- Anan Liu, Yuting Su, Weizhi Nie, and Mohan S. Kankanhalli. 2017. Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition. TPAMI, Vol. 39, 1 (2017), 102--114. Google ScholarDigital Library
- Gaowen Liu, Yan Yan, Elisa Ricci, Yi Yang, Yahong Han, Stefan Winkler, and Nicu Sebe. 2015. Inferring Painting Style with Multi-task Dictionary Learning IJCAI. 2162--2168. Google ScholarDigital Library
- Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. 2015. Learning Transferable Features with Deep Adaptation Networks ICML. 97--105. Google ScholarDigital Library
- J. Mairal, F. Bach, and J. Ponce. 2012 a. Task-Driven Dictionary Learning. TPAMI, Vol. 34, 4 (2012), 791--804. Google ScholarDigital Library
- Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. 2009. Online Dictionary Learning for Sparse Coding. In ICML. 689--696. Google ScholarDigital Library
- Julien Mairal, Francis R. Bach, and Jean Ponce. 2012 b. Task-Driven Dictionary Learning. TPAMI, Vol. 34, 4 (2012), 791--804. Google ScholarDigital Library
- Julien Mairal, Michael Elad, and Guillermo Sapiro. 2008. Sparse representation for color image restoration. TIP, Vol. 17, 1 (2008), 53--69. Google ScholarDigital Library
- Julien Mairal, Jean Ponce, Guillermo Sapiro, Andrew Zisserman, and Francis R. Bach. 2009. Supervised Dictionary Learning. NIPS. 1033--1040. Google ScholarDigital Library
- Annamaria Mesaros, Toni Heittola, Antti Eronen, and Tuomas Virtanen. 2010 a. Acoustic event detection in real life recordings. EUSIPCO. 1267--1271.Google Scholar
- Annamaria Mesaros, Toni Heittola, Antti J. Eronen, and Tuomas Virtanen. 2010 b. Acoustic event detection in real life recordings. EUSIPCO. 1267--1271.Google Scholar
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and Their Compositionality NIPS. 3111--3119. Google ScholarDigital Library
- Gianluca Monaci, Philippe Jost, Pierre Vandergheynst, Boris Mailhe, Sylvain Lesage, and Rémi Gribonval. 2007. Learning multimodal dictionaries. TIP, Vol. 16, 9 (2007), 2272--2283. Google ScholarDigital Library
- Stephanie Lynne Pancoast, Murat Akbacak, and Michelle Hewlett Sanchez. 2012. Supervised acoustic concept extraction for multimedia event detection Proceedings of the 2012 ACM international workshop on Audio and multimedia methods for large-scale video analysis. ACM, 9--14. Google ScholarDigital Library
- Mirco Ravanelli, Benjamin Elizalde, Karl Ni, and Gerald Friedland. 2014. Audio concept classification with hierarchical deep neural networks EUSIPCO. 606--610.Google Scholar
- S. Sadanand and J. J. Corso. 2012. Action bank: A high-level representation of activity in video CVPR. 1234--1241. Google ScholarDigital Library
- Xuemeng Song, Liqiang Nie, Luming Zhang, Mohammad Akbari, and Tat-Seng Chua. 2015. Multiple social network learning and its application in volunteerism tendency prediction SIGIR. 213--222. Google ScholarDigital Library
- Meng Wang, Xian-Sheng Hua, Richang Hong, Jinhui Tang, Guo-Jun Qi, and Yan Song. 2009. Unified video annotation via multigraph learning. TCSVT, Vol. 19, 5 (2009), 733--746. Google ScholarDigital Library
- Meng Wang, Hao Li, Dacheng Tao, Ke Lu, and Xindong Wu. 2012. Multimodal graph-based reranking for web image search. TIP, Vol. 21, 11 (2012), 4649--4661. Google ScholarDigital Library
- Xiang Wang, Xiangnan He, Liqiang Nie, and Tat-Seng Chua. 2017. Item Silk Road: Recommending Items from Information Domains to Social Users. (2017).Google Scholar
- Xiang Wang, Liqiang Nie, Xuemeng Song, Dongxiang Zhang, and Tat-Seng Chua. 2017. Unifying virtual and physical worlds: Learning toward local and global consistency. TOIS, Vol. 36, 1 (2017), 4. Google ScholarDigital Library
- Yipei Wang, Shourabh Rawat, and Florian Metze. 2014. Exploring audio semantic concepts for event-based video retrieval ICASSP. 1360--1364.Google Scholar
- Meng Yang, Weiyang Liu, Weixin Luo, and Linlin Shen. 2016 b. Analysis-Synthesis Dictionary Learning for Universality-Particularity Representation Based Classification. In AAAI. 2251--2257. Google ScholarDigital Library
- Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. 2016 a. Revisiting Semi-Supervised Learning with Graph Embeddings ICML. 40--48. Google ScholarDigital Library
- Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? NIPS. 3320--3328. Google ScholarDigital Library
- Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017. Visual Translation Embedding Network for Visual Relation Detection CVPR.Google Scholar
- Hanwang Zhang, Zheng-Jun Zha, Yang Yang, Shuicheng Yan, and Tat-Seng Chua. 2014. Robust (semi) nonnegative graph embedding. TIP, Vol. 23, 7 (2014), 2996--3012.Google ScholarCross Ref
- Jianglong Zhang, Liqiang Nie, Xiang Wang, Xiangnan He, Xianglin Huang, and Tat-Seng Chua. 2016. Shorter-is-Better: Venue Category Estimation from Micro-Video MM. 1415--1424. Google ScholarDigital Library
- Yueting Zhuang, Yanfei Wang, Fei Wu, Yin Zhang, and Weiming Lu. 2013. Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval AAAI. 1070--1076. Google ScholarDigital Library
Index Terms
- Enhancing Micro-video Understanding by Harnessing External Sounds
Recommendations
Towards Micro-video Understanding by Joint Sequential-Sparse Modeling
MM '17: Proceedings of the 25th ACM international conference on MultimediaLike the traditional long videos, micro-videos are the unity of textual, acoustic, and visual modalities. These modalities sequentially tell a real-life event from distinct angles. Yet, unlike the traditional long videos with rich content, micro-videos ...
Improving Micro-video Recommendation by Controlling Position Bias
Machine Learning and Knowledge Discovery in DatabasesAbstractAs the micro-video apps become popular, the numbers of micro-videos and users increase rapidly, which highlights the importance of micro-video recommendation. Although the micro-video recommendation can be naturally treated as the sequential ...
Comments