skip to main content
10.1145/3123266.3123341acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Towards Micro-video Understanding by Joint Sequential-Sparse Modeling

Published:19 October 2017Publication History

ABSTRACT

Like the traditional long videos, micro-videos are the unity of textual, acoustic, and visual modalities. These modalities sequentially tell a real-life event from distinct angles. Yet, unlike the traditional long videos with rich content, micro-videos are very short, lasting for 6-15 seconds, and they hence usually convey one or a few high-level concepts. In the light of this, we have to characterize and jointly model the sparseness and multiple sequential structures for better micro-video understanding. To accomplish this, in this paper, we present an end-to-end deep learning model, which packs three parallel LSTMs to capture the sequential structures and a convolutional neural network to learn the sparse concept-level representations of micro-videos. We applied our model to the application of micro-video categorization. Besides, we constructed a real-world dataset for sequence modeling and released it to facilitate other researchers. Experimental results demonstrate that our model yields better performance than several state-of-the-art baselines.

References

  1. Grigory Antipov, Sid-Ahmed Berrani, Natacha Ruchaud, and Jean-Luc Dugelay. 2015. Learned vs. hand-crafted features for pedestrian gender recognition ACM MM. 1263--1266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Moez Baccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and Atilla Baskurt. 2011. Sequential deep learning for human action recognition HBU. 29--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Soheil Bahrampour, Nasser M Nasrabadi, Asok Ray, and William Kenneth Jenkins. 2016. Multimodal task-driven dictionary learning for image classification. IEEE TIP, Vol. 25, 1 (2016), 24--38.Google ScholarGoogle Scholar
  4. Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE NN, Vol. 5, 2 (1994), 157--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jingyuan Chen, Xuemeng Song, Liqiang Nie, Xiang Wang, Hanwang Zhang, and Tat-Seng Chua. 2016. Micro tells macro: predicting the popularity of micro-videos via a transductive model ACM MM. 898--907. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ken Chen, Bao-Liang Lu, and James T Kwok. 2006. Efficient classification of multi-label and imbalanced data using min-max modular classifiers. In IEEE IJCNN. 1770--1775.Google ScholarGoogle Scholar
  7. Dan Ciregan, Ueli Meier, and Jürgen Schmidhuber. 2012. Multi-column deep neural networks for image classification IEEE CVPR. 3642--3649. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cheng Deng, Xu Tang, Junchi Yan, Wei Liu, and Xinbo Gao. 2016. Discriminative dictionary learning with common label alignment for cross-modal retrieval. IEEE MM, Vol. 18, 2 (2016), 208--218.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description IEEE CVPR. 2625--2634.Google ScholarGoogle Scholar
  10. Chao Dong, Change Loy Chen, Kaiming He, and Xiaoou Tang. 2016. Image super-resolution using deep convolutional networks. IEEE PAMI, Vol. 38, 2 (2016), 295--307. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Florian Eyben, Felix Weninger, Florian Gross, and Björn Schuller. 2013. Recent developments in opensmile, the munich open-source multimedia feature extractor ACM MM. 835--838. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Felix A Gers and E Schmidhuber. 2001. LSTM recurrent networks learn simple context-free and context-sensitive languages. IEEE NN, Vol. 12, 6 (2001), 1333--1340. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Felix A Gers, Nicol N Schraudolph, and Jürgen Schmidhuber. 2002. Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research Vol. 3, Aug (2002), 115--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks ICML, Vol. Vol. 14. 1764--1772. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Alex Graves and Jürgen Schmidhuber. 2009. Offline handwriting recognition with multidimensional recurrent neural networks NIPS. 545--552. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Sepp Hochreiter, Martin Heusel, and Klaus Obermayer. 2007. Fast model-based protein homology detection without alignment. Bioinformatics, Vol. 23, 14 (2007), 1728--1736. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Sepp Hochreiter and Jiirgen Schmidhuber. 1997. LTSM can solve hard time lag problems. In NIPS. 473--479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Viren Jain and Sebastian Seung. 2009. Natural image denoising with convolutional networks NIPS. 769--776. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding ACM MM. 675--678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks IEEE CVPR. 1725--1732. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Markus Koskela and Jorma Laaksonen. 2014. Convolutional network features for scene recognition ACM MM. 1169--1172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks NIPS. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE Vol. 86, 11 (1998), 2278--2324.Google ScholarGoogle ScholarCross RefCross Ref
  24. Bruno Lepri, Nadia Mana, Alessandro Cappelletti, and Fabio Pianesi. 2009. Automatic prediction of individual performance from thin slices of social behavior ACM MM. 733--736. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. David D Lewis. 1991. Evaluating text categorization. In HLT. 312--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Guang Li, Shubo Ma, and Yahong Han. 2015. Summarization-based video caption via deep neural networks ACM MM. 1191--1194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yehao Li, Ting Yao, Tao Mei, Hongyang Chao, and Yong Rui. 2016. Share-and-chat: Achieving human-level video commenting by search and multi-view embedding ACM MM. 928--937. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Lie Lu, Hao Jiang, and HongJiang Zhang. 2001. A robust audio classification and segmentation method ACM MM. 203--211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Interspeech, Vol. Vol. 2. 3--3.Google ScholarGoogle Scholar
  30. Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines ICML. 807--814. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Phuc Xuan Nguyen, Gregory Rogez, Charless Fowlkes, and Deva Ramanan. 2016. The open world of micro-videos. arXiv preprint arXiv:1603.09439 (2016).Google ScholarGoogle Scholar
  32. Wanli Ouyang and Xiaogang Wang. 2013. Joint deep learning for pedestrian detection. In IEEE ICCV. 2056--2063. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection IEEE CVPR. 779--788.Google ScholarGoogle Scholar
  34. Jimmy Ren, Yongtao Hu, Yu-Wing Tai, Chuan Wang, Li Xu, Wenxiu Sun, and Qiong Yan. 2016. Look, listen and learn-a multimodal LSTM for speaker identification. arXiv preprint arXiv:1602.04364 (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks NIPS. 91--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Chris Sanden and John Z Zhang. 2011. Enhancing multi-label music genre classification through ensemble techniques ACM SIGIR. 705--714. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Jürgen Schmidhuber, Daan Wierstra, and Faustino Gomez. 2005. Evolino: Hybrid neuroevolution optimal linear search for sequence learning IJCAI. 853--858. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering IEEE CVPR. 815--823.Google ScholarGoogle Scholar
  39. Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. CNN features off-the-shelf: an astounding baseline for recognition IEEE CVPR. 806--813.Google ScholarGoogle Scholar
  40. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos NIPS. 568--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. 2015. Unsupervised learning of video representations using LSTMs ICML. 843--852. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks NIPS. 3104--3112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, and Lior Wolf. 2014. Deepface: Closing the gap to human-level performance in face verification IEEE CVPR. 1701--1708. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Srinivas C Turaga, Joseph F Murray, Viren Jain, Fabian Roth, Moritz Helmstaedter, Kevin Briggman, Winfried Denk, and H Sebastian Seung. 2010. Convolutional networks can learn to generate affinity graphs for image segmentation. Neural computation, Vol. 22, 2 (2010), 511--538. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Vivek Veeriah, Naifan Zhuang, and Guo-Jun Qi. 2015. Differential recurrent neural networks for action recognition IEEE ICCV. 4041--4049. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Cheng Wang, Haojin Yang, Christian Bartz, and Christoph Meinel. 2016. Image captioning with deep bidirectional LSTMs. In ACM MM. 988--997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Xiang Wang, Xiangnan He, Liqiang Nie, and Tat-Seng Chua. 2017. Item Silk Road: Recommending Items from Information Domains to Social Users ACM SIGIR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue. 2015. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification ACM MM. 461--470. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Zhongwen Xu, Yi Yang, and Alex G Hauptmann. 2015. A discriminative CNN video representation for event detection IEEE CVPR. 1798--1807.Google ScholarGoogle Scholar
  50. Jianglong Zhang, Liqiang Nie, Xiang Wang, Xiangnan He, Xianglin Huang, and Tat Seng Chua. 2016. Shorter-is-better: Venue category estimation from micro-video ACM MM. 1415--1424. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Towards Micro-video Understanding by Joint Sequential-Sparse Modeling

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '17: Proceedings of the 25th ACM international conference on Multimedia
      October 2017
      2028 pages
      ISBN:9781450349062
      DOI:10.1145/3123266

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 October 2017

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      MM '17 Paper Acceptance Rate189of684submissions,28%Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader