ABSTRACT
Like the traditional long videos, micro-videos are the unity of textual, acoustic, and visual modalities. These modalities sequentially tell a real-life event from distinct angles. Yet, unlike the traditional long videos with rich content, micro-videos are very short, lasting for 6-15 seconds, and they hence usually convey one or a few high-level concepts. In the light of this, we have to characterize and jointly model the sparseness and multiple sequential structures for better micro-video understanding. To accomplish this, in this paper, we present an end-to-end deep learning model, which packs three parallel LSTMs to capture the sequential structures and a convolutional neural network to learn the sparse concept-level representations of micro-videos. We applied our model to the application of micro-video categorization. Besides, we constructed a real-world dataset for sequence modeling and released it to facilitate other researchers. Experimental results demonstrate that our model yields better performance than several state-of-the-art baselines.
- Grigory Antipov, Sid-Ahmed Berrani, Natacha Ruchaud, and Jean-Luc Dugelay. 2015. Learned vs. hand-crafted features for pedestrian gender recognition ACM MM. 1263--1266. Google ScholarDigital Library
- Moez Baccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and Atilla Baskurt. 2011. Sequential deep learning for human action recognition HBU. 29--39. Google ScholarDigital Library
- Soheil Bahrampour, Nasser M Nasrabadi, Asok Ray, and William Kenneth Jenkins. 2016. Multimodal task-driven dictionary learning for image classification. IEEE TIP, Vol. 25, 1 (2016), 24--38.Google Scholar
- Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE NN, Vol. 5, 2 (1994), 157--166. Google ScholarDigital Library
- Jingyuan Chen, Xuemeng Song, Liqiang Nie, Xiang Wang, Hanwang Zhang, and Tat-Seng Chua. 2016. Micro tells macro: predicting the popularity of micro-videos via a transductive model ACM MM. 898--907. Google ScholarDigital Library
- Ken Chen, Bao-Liang Lu, and James T Kwok. 2006. Efficient classification of multi-label and imbalanced data using min-max modular classifiers. In IEEE IJCNN. 1770--1775.Google Scholar
- Dan Ciregan, Ueli Meier, and Jürgen Schmidhuber. 2012. Multi-column deep neural networks for image classification IEEE CVPR. 3642--3649. Google ScholarDigital Library
- Cheng Deng, Xu Tang, Junchi Yan, Wei Liu, and Xinbo Gao. 2016. Discriminative dictionary learning with common label alignment for cross-modal retrieval. IEEE MM, Vol. 18, 2 (2016), 208--218.Google ScholarDigital Library
- Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description IEEE CVPR. 2625--2634.Google Scholar
- Chao Dong, Change Loy Chen, Kaiming He, and Xiaoou Tang. 2016. Image super-resolution using deep convolutional networks. IEEE PAMI, Vol. 38, 2 (2016), 295--307. Google ScholarDigital Library
- Florian Eyben, Felix Weninger, Florian Gross, and Björn Schuller. 2013. Recent developments in opensmile, the munich open-source multimedia feature extractor ACM MM. 835--838. Google ScholarDigital Library
- Felix A Gers and E Schmidhuber. 2001. LSTM recurrent networks learn simple context-free and context-sensitive languages. IEEE NN, Vol. 12, 6 (2001), 1333--1340. Google ScholarDigital Library
- Felix A Gers, Nicol N Schraudolph, and Jürgen Schmidhuber. 2002. Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research Vol. 3, Aug (2002), 115--143. Google ScholarDigital Library
- Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks ICML, Vol. Vol. 14. 1764--1772. Google ScholarDigital Library
- Alex Graves and Jürgen Schmidhuber. 2009. Offline handwriting recognition with multidimensional recurrent neural networks NIPS. 545--552. Google ScholarDigital Library
- Sepp Hochreiter, Martin Heusel, and Klaus Obermayer. 2007. Fast model-based protein homology detection without alignment. Bioinformatics, Vol. 23, 14 (2007), 1728--1736. Google ScholarDigital Library
- Sepp Hochreiter and Jiirgen Schmidhuber. 1997. LTSM can solve hard time lag problems. In NIPS. 473--479. Google ScholarDigital Library
- Viren Jain and Sebastian Seung. 2009. Natural image denoising with convolutional networks NIPS. 769--776. Google ScholarDigital Library
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding ACM MM. 675--678. Google ScholarDigital Library
- Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks IEEE CVPR. 1725--1732. Google ScholarDigital Library
- Markus Koskela and Jorma Laaksonen. 2014. Convolutional network features for scene recognition ACM MM. 1169--1172. Google ScholarDigital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks NIPS. 1097--1105. Google ScholarDigital Library
- Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE Vol. 86, 11 (1998), 2278--2324.Google ScholarCross Ref
- Bruno Lepri, Nadia Mana, Alessandro Cappelletti, and Fabio Pianesi. 2009. Automatic prediction of individual performance from thin slices of social behavior ACM MM. 733--736. Google ScholarDigital Library
- David D Lewis. 1991. Evaluating text categorization. In HLT. 312--318. Google ScholarDigital Library
- Guang Li, Shubo Ma, and Yahong Han. 2015. Summarization-based video caption via deep neural networks ACM MM. 1191--1194. Google ScholarDigital Library
- Yehao Li, Ting Yao, Tao Mei, Hongyang Chao, and Yong Rui. 2016. Share-and-chat: Achieving human-level video commenting by search and multi-view embedding ACM MM. 928--937. Google ScholarDigital Library
- Lie Lu, Hao Jiang, and HongJiang Zhang. 2001. A robust audio classification and segmentation method ACM MM. 203--211. Google ScholarDigital Library
- Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Interspeech, Vol. Vol. 2. 3--3.Google Scholar
- Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines ICML. 807--814. Google ScholarDigital Library
- Phuc Xuan Nguyen, Gregory Rogez, Charless Fowlkes, and Deva Ramanan. 2016. The open world of micro-videos. arXiv preprint arXiv:1603.09439 (2016).Google Scholar
- Wanli Ouyang and Xiaogang Wang. 2013. Joint deep learning for pedestrian detection. In IEEE ICCV. 2056--2063. Google ScholarDigital Library
- Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection IEEE CVPR. 779--788.Google Scholar
- Jimmy Ren, Yongtao Hu, Yu-Wing Tai, Chuan Wang, Li Xu, Wenxiu Sun, and Qiong Yan. 2016. Look, listen and learn-a multimodal LSTM for speaker identification. arXiv preprint arXiv:1602.04364 (2016). Google ScholarDigital Library
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks NIPS. 91--99. Google ScholarDigital Library
- Chris Sanden and John Z Zhang. 2011. Enhancing multi-label music genre classification through ensemble techniques ACM SIGIR. 705--714. Google ScholarDigital Library
- Jürgen Schmidhuber, Daan Wierstra, and Faustino Gomez. 2005. Evolino: Hybrid neuroevolution optimal linear search for sequence learning IJCAI. 853--858. Google ScholarDigital Library
- Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering IEEE CVPR. 815--823.Google Scholar
- Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. CNN features off-the-shelf: an astounding baseline for recognition IEEE CVPR. 806--813.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos NIPS. 568--576. Google ScholarDigital Library
- Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. 2015. Unsupervised learning of video representations using LSTMs ICML. 843--852. Google ScholarDigital Library
- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks NIPS. 3104--3112. Google ScholarDigital Library
- Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, and Lior Wolf. 2014. Deepface: Closing the gap to human-level performance in face verification IEEE CVPR. 1701--1708. Google ScholarDigital Library
- Srinivas C Turaga, Joseph F Murray, Viren Jain, Fabian Roth, Moritz Helmstaedter, Kevin Briggman, Winfried Denk, and H Sebastian Seung. 2010. Convolutional networks can learn to generate affinity graphs for image segmentation. Neural computation, Vol. 22, 2 (2010), 511--538. Google ScholarDigital Library
- Vivek Veeriah, Naifan Zhuang, and Guo-Jun Qi. 2015. Differential recurrent neural networks for action recognition IEEE ICCV. 4041--4049. Google ScholarDigital Library
- Cheng Wang, Haojin Yang, Christian Bartz, and Christoph Meinel. 2016. Image captioning with deep bidirectional LSTMs. In ACM MM. 988--997. Google ScholarDigital Library
- Xiang Wang, Xiangnan He, Liqiang Nie, and Tat-Seng Chua. 2017. Item Silk Road: Recommending Items from Information Domains to Social Users ACM SIGIR. Google ScholarDigital Library
- Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue. 2015. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification ACM MM. 461--470. Google ScholarDigital Library
- Zhongwen Xu, Yi Yang, and Alex G Hauptmann. 2015. A discriminative CNN video representation for event detection IEEE CVPR. 1798--1807.Google Scholar
- Jianglong Zhang, Liqiang Nie, Xiang Wang, Xiangnan He, Xianglin Huang, and Tat Seng Chua. 2016. Shorter-is-better: Venue category estimation from micro-video ACM MM. 1415--1424. Google ScholarDigital Library
Index Terms
- Towards Micro-video Understanding by Joint Sequential-Sparse Modeling
Recommendations
Personalized Hashtag Recommendation for Micro-videos
MM '19: Proceedings of the 27th ACM International Conference on MultimediaPersonalized hashtag recommendation methods aim to suggest users hashtags to annotate, categorize, and describe their posts. The hashtags, that a user provides to a post (e.g., a micro-video), are the ones which in her mind can well describe the post ...
Enhancing Micro-video Understanding by Harnessing External Sounds
MM '17: Proceedings of the 25th ACM international conference on MultimediaDifferent from traditional long videos, micro-videos are much shorter and usually recorded at a specific place with mobile devices. To better understand the semantics of a micro-video and facilitate downstream applications, it is crucial to estimate the ...
Comments