research-article

Enhancing Micro-video Understanding by Harnessing External Sounds

Authors:
Liqiang Nie

ShanDong University, Jinan, China

ShanDong University, Jinan, China
View Profile

,
Xiang Wang

National University of Singapore, Singapore, Singapore

National University of Singapore, Singapore, Singapore
View Profile

,
Jianglong Zhang

Communication University of China, Beijing, China

Communication University of China, Beijing, China
View Profile

,
Xiangnan He

National University of Singapore, Singapore, Singapore

National University of Singapore, Singapore, Singapore
View Profile

,
Hanwang Zhang

Columbia University, New York, NY, USA

Columbia University, New York, NY, USA
View Profile

,
Richang Hong

Hefei University of Technology, Hefei, China

Hefei University of Technology, Hefei, China
View Profile

,
Qi Tian

University of Texas at San Antonio, San Antonio, TX, USA

University of Texas at San Antonio, San Antonio, TX, USA
View Profile

MM '17: Proceedings of the 25th ACM international conference on MultimediaOctober 2017Pages 1192–1200https://doi.org/10.1145/3123266.3123313

Published:19 October 2017Publication History

MM '17: Proceedings of the 25th ACM international conference on Multimedia

Pages 1192–1200

ABSTRACT

Different from traditional long videos, micro-videos are much shorter and usually recorded at a specific place with mobile devices. To better understand the semantics of a micro-video and facilitate downstream applications, it is crucial to estimate the venue where the micro-video is recorded, for example, in a concert or on a beach. However, according to our statistics over two million micro-videos, only $1.22%$ of them were labeled with location information. For the remaining large number of micro-videos without location information, we have to rely on their content to estimate their venue categories. This is a highly challenging task, as micro-videos are naturally multi-modal (with textual, visual and, acoustic content), and more importantly, the quality of each modality varies greatly for different micro-videos.

In this work, we focus on enhancing the acoustic modality for the venue category estimation task. This is motivated by our finding that although the acoustic signal can well complement the visual and textual signal in reflecting a micro-video's venue, its quality is usually relatively lower. As such, simply integrating acoustic features with visual and textual features only leads to suboptimal results, or even adversely degrades the overall performance (cf the barrel theory). To address this, we propose to compensate the shortest board --- the acoustic modality --- via harnessing the external sound knowledge. We develop a deep transfer model which can jointly enhance the concept-level representation of micro-videos and the venue category prediction. To alleviate the sparsity problem of unpopular categories, we further regularize the representation learning of micro-videos of the same venue category. Through extensive experiments on a real-world dataset, we show that our model significantly outperforms the state-of-the-art method in terms of both Micro-F1 and Macro-F1 scores by leveraging the external acoustic knowledge.

References

Khalid Ashraf, Benjamin Elizalde, Forrest Iandola, Matthew Moskewicz, Julia Bernd, Gerald Friedland, and Kurt Keutzer. 2015. Audio-based multimedia event detection with DNNs and sparse sampling ICMR. 611--614. Google ScholarDigital Library
Soheil Bahrampour, Nasser M Nasrabadi, Asok Ray, and William Kenneth Jenkins. 2016. Multimodal task-driven dictionary learning for image classification. TIP, Vol. 25, 1 (2016), 24--38.Google ScholarDigital Library
Susanne Burger, Qin Jin, Peter F Schulam, and Florian Metze. 2012. Noisemes: Manual annotation of environmental noise in audio streams. Technical report Carnegie Mellon University-LTI-12-07 (2012), 1--5.Google Scholar
Song Cao and Noah Snavely. 2013. Graph-based discriminative learning for location recognition CVPR. 700--707. Google ScholarDigital Library
Diego Castan and Murat Akbacak. 2013. Segmental-GMM Approach based on Acoustic Concept Segmentation SLAM@ INTERSPEECH. 15--19.Google Scholar
Sourish Chaudhuri and Bhiksha Raj. 2012. Unsupervised structure discovery for semantic analysis of audio NIPS. 1178--1186. Google ScholarDigital Library
Jingyuan Chen, Xuemeng Song, Liqiang Nie, Xiang Wang, Hanwang Zhang, and Tat-Seng Chua. 2016. Micro Tells Macro: Predicting the Popularity of Micro-Videos via a Transductive Model MM. 898--907. Google ScholarDigital Library
Ning Chen, Jun Zhu, and Eric P Xing. 2010. Predictive subspace learning for multi-view data: a large margin approach NIPS. 361--369. Google ScholarDigital Library
Jaeyoung Choi, Gerald Friedland, Venkatesan Ekambaram, and Kannan Ramchandran. 2012. Multimodal location estimation of consumer media: Dealing with sparse training data ICME. 43--48. Google ScholarDigital Library
Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition ICML. 647--655. Google ScholarDigital Library
M. Elad and M. Aharon. 2006. Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries. TIP, Vol. 15, 12 (2006), 3736--3745. Google ScholarDigital Library
Fuli Feng, Liqiang Nie, Xiang Wang, Richang Hong, and Tat-Seng Chua. 2017. Computational social indicators: a case study of Chinese university ranking SIGIR. Google ScholarDigital Library
Gerald Friedland, Jaeyoung Choi, Howard Lei, and Adam Janin. 2011. Multimodal location estimation on Flickr videos. MM. 23--28. Google ScholarDigital Library
Siddharth Gopal and Yiming Yang. 2013. Recursive Regularization for Large-scale Classification with Hierarchical and Graphical Dependencies. In SIGKDD. 257--265. Google ScholarDigital Library
James Hays and Alexei A Efros. 2008. IM2GPS: estimating geographic information from a single image CVPR. 1--8.Google Scholar
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In WWW. Google ScholarDigital Library
Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Chua Tat-Seng. 2016. Fast Matrix Factorization for Online Recommendation with Implicit Feedback SIGIR. Google ScholarDigital Library
Adam Kilgarriff and Christiane Fellbaum. 2000. WordNet: An Electronic Lexical Database. (2000).Google Scholar
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks NIPS. 1106--1114. Google ScholarDigital Library
Anan Liu, Weizhi Nie, Yue Gao, and Yuting Su. 2016. Multi-Modal Clique-Graph Matching for View-Based 3D Model Retrieval. TIP, Vol. 25, 5 (2016), 2103--2116. Google ScholarDigital Library
Anan Liu, Yuting Su, Weizhi Nie, and Mohan S. Kankanhalli. 2017. Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition. TPAMI, Vol. 39, 1 (2017), 102--114. Google ScholarDigital Library
Gaowen Liu, Yan Yan, Elisa Ricci, Yi Yang, Yahong Han, Stefan Winkler, and Nicu Sebe. 2015. Inferring Painting Style with Multi-task Dictionary Learning IJCAI. 2162--2168. Google ScholarDigital Library
Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. 2015. Learning Transferable Features with Deep Adaptation Networks ICML. 97--105. Google ScholarDigital Library
J. Mairal, F. Bach, and J. Ponce. 2012 a. Task-Driven Dictionary Learning. TPAMI, Vol. 34, 4 (2012), 791--804. Google ScholarDigital Library
Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. 2009. Online Dictionary Learning for Sparse Coding. In ICML. 689--696. Google ScholarDigital Library
Julien Mairal, Francis R. Bach, and Jean Ponce. 2012 b. Task-Driven Dictionary Learning. TPAMI, Vol. 34, 4 (2012), 791--804. Google ScholarDigital Library
Julien Mairal, Michael Elad, and Guillermo Sapiro. 2008. Sparse representation for color image restoration. TIP, Vol. 17, 1 (2008), 53--69. Google ScholarDigital Library
Julien Mairal, Jean Ponce, Guillermo Sapiro, Andrew Zisserman, and Francis R. Bach. 2009. Supervised Dictionary Learning. NIPS. 1033--1040. Google ScholarDigital Library
Annamaria Mesaros, Toni Heittola, Antti Eronen, and Tuomas Virtanen. 2010 a. Acoustic event detection in real life recordings. EUSIPCO. 1267--1271.Google Scholar
Annamaria Mesaros, Toni Heittola, Antti J. Eronen, and Tuomas Virtanen. 2010 b. Acoustic event detection in real life recordings. EUSIPCO. 1267--1271.Google Scholar
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and Their Compositionality NIPS. 3111--3119. Google ScholarDigital Library
Gianluca Monaci, Philippe Jost, Pierre Vandergheynst, Boris Mailhe, Sylvain Lesage, and Rémi Gribonval. 2007. Learning multimodal dictionaries. TIP, Vol. 16, 9 (2007), 2272--2283. Google ScholarDigital Library
Stephanie Lynne Pancoast, Murat Akbacak, and Michelle Hewlett Sanchez. 2012. Supervised acoustic concept extraction for multimedia event detection Proceedings of the 2012 ACM international workshop on Audio and multimedia methods for large-scale video analysis. ACM, 9--14. Google ScholarDigital Library
Mirco Ravanelli, Benjamin Elizalde, Karl Ni, and Gerald Friedland. 2014. Audio concept classification with hierarchical deep neural networks EUSIPCO. 606--610.Google Scholar
S. Sadanand and J. J. Corso. 2012. Action bank: A high-level representation of activity in video CVPR. 1234--1241. Google ScholarDigital Library
Xuemeng Song, Liqiang Nie, Luming Zhang, Mohammad Akbari, and Tat-Seng Chua. 2015. Multiple social network learning and its application in volunteerism tendency prediction SIGIR. 213--222. Google ScholarDigital Library
Meng Wang, Xian-Sheng Hua, Richang Hong, Jinhui Tang, Guo-Jun Qi, and Yan Song. 2009. Unified video annotation via multigraph learning. TCSVT, Vol. 19, 5 (2009), 733--746. Google ScholarDigital Library
Meng Wang, Hao Li, Dacheng Tao, Ke Lu, and Xindong Wu. 2012. Multimodal graph-based reranking for web image search. TIP, Vol. 21, 11 (2012), 4649--4661. Google ScholarDigital Library
Xiang Wang, Xiangnan He, Liqiang Nie, and Tat-Seng Chua. 2017. Item Silk Road: Recommending Items from Information Domains to Social Users. (2017).Google Scholar
Xiang Wang, Liqiang Nie, Xuemeng Song, Dongxiang Zhang, and Tat-Seng Chua. 2017. Unifying virtual and physical worlds: Learning toward local and global consistency. TOIS, Vol. 36, 1 (2017), 4. Google ScholarDigital Library
Yipei Wang, Shourabh Rawat, and Florian Metze. 2014. Exploring audio semantic concepts for event-based video retrieval ICASSP. 1360--1364.Google Scholar
Meng Yang, Weiyang Liu, Weixin Luo, and Linlin Shen. 2016 b. Analysis-Synthesis Dictionary Learning for Universality-Particularity Representation Based Classification. In AAAI. 2251--2257. Google ScholarDigital Library
Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. 2016 a. Revisiting Semi-Supervised Learning with Graph Embeddings ICML. 40--48. Google ScholarDigital Library
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? NIPS. 3320--3328. Google ScholarDigital Library
Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017. Visual Translation Embedding Network for Visual Relation Detection CVPR.Google Scholar
Hanwang Zhang, Zheng-Jun Zha, Yang Yang, Shuicheng Yan, and Tat-Seng Chua. 2014. Robust (semi) nonnegative graph embedding. TIP, Vol. 23, 7 (2014), 2996--3012.Google ScholarCross Ref
Jianglong Zhang, Liqiang Nie, Xiang Wang, Xiangnan He, Xianglin Huang, and Tat-Seng Chua. 2016. Shorter-is-Better: Venue Category Estimation from Micro-Video MM. 1415--1424. Google ScholarDigital Library
Yueting Zhuang, Yanfei Wang, Fei Wu, Yin Zhang, and Weiming Lu. 2013. Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval AAAI. 1070--1076. Google ScholarDigital Library

Index Terms

Enhancing Micro-video Understanding by Harnessing External Sounds
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
    2. Retrieval tasks and goals
      1. Recommender systems
  2. World Wide Web
    1. Web searching and information discovery
      1. Social recommendation

Recommendations

Towards Micro-video Understanding by Joint Sequential-Sparse Modeling
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Like the traditional long videos, micro-videos are the unity of textual, acoustic, and visual modalities. These modalities sequentially tell a real-life event from distinct angles. Yet, unlike the traditional long videos with rich content, micro-videos ...
Read More
Multimodal Learning toward Micro-Video Understanding
Read More
Improving Micro-video Recommendation by Controlling Position Bias
Machine Learning and Knowledge Discovery in Databases
Abstract
As the micro-video apps become popular, the numbers of micro-videos and users increase rapidly, which highlights the importance of micro-video recommendation. Although the micro-video recommendation can be naturally treated as the sequential ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '17: Proceedings of the 25th ACM international conference on Multimedia
October 2017
2028 pages
ISBN:9781450349062
DOI:10.1145/3123266
General Chairs:
Qiong Liu
FXPAL, USA
,
Rainer Lienhart
Universität Augsburg, Germany
,
Haohong Wang
TCL America, USA
,
Program Chairs:
Sheng-Wei "Kuan-Ta" Chen
Academia Sinica, Taiwan
,
Susanne Boll
University of Oldenburg, Germany
,
Phoebe Chen
La Trobe University, Australia
,
Gerald Friedland
Lawrence Livermore National Lab, USA
,
Jia Li
Google, USA
,
Shuicheng Yan
Qihoo 360, China
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 October 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
deep neural network
external sound knowledge
micro-video categorization
representation learning
Qualifiers
- research-article
Conference

Acceptance Rates
MM '17 Paper Acceptance Rate189of684submissions,28%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 70
  Total Citations
  View Citations
- 402
  Total Downloads
- Downloads (Last 12 months)23
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Enhancing Micro-video Understanding by Harnessing External Sounds

MM '17: Proceedings of the 25th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Towards Micro-video Understanding by Joint Sequential-Sparse Modeling

Multimodal Learning toward Micro-Video Understanding

Improving Micro-video Recommendation by Controlling Position Bias