skip to main content
10.1145/2733373.2806226acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open Access

Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images

Published:13 October 2015Publication History

ABSTRACT

We address the problem of fine-grained action localization from temporally untrimmed web videos. We assume that only weak video-level annotations are available for training. The goal is to use these weak labels to identify temporal segments corresponding to the actions, and learn models that generalize to unconstrained web videos. We find that web images queried by action names serve as well-localized highlights for many actions, but are noisily labeled. To solve this problem, we propose a simple yet effective method that takes weak video labels and noisy image labels as input, and generates localized action frames as output. This is achieved by cross-domain transfer between video frames and web images, using pre-trained deep convolutional neural networks. We then use the localized action frames to train action recognition models with long short-term memory networks. We collect a fine-grained sports action data set FGA-240 of more than 130,000 YouTube videos. It has 240 fine-grained actions under 85 sports activities. Convincing results are shown on the FGA-240 data set, as well as the THUMOS 2014 localization data set with untrimmed training videos.

References

  1. J. Chen, Y. Cui, G. Ye, D. Liu, and S. Chang. Event-driven semantic concept discovery by exploiting weakly tagged internet images. In ICMR, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  3. S. K. Divvala, A. Farhadi, and C. Guestrin. Learning everything about anything: Webly-supervised visual concept learning. In CVPR, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  5. A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  6. A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In NIPS, 2008.Google ScholarGoogle Scholar
  7. A. Habibian, K. E. A. van de Sande, and C. G. M. Snoek. Recommendations for video event recognition using concept vocabularies. In ICMR, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Jain, J. van Gemert, H. Jégou, P. Bouthemy, and C. G. M. Snoek. Action localization with tubelets from motion. In CVPR, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/, 2014.Google ScholarGoogle Scholar
  11. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. TACL, 2015.Google ScholarGoogle Scholar
  13. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In ICCV, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Oneata, J. Verbeek, and C. Schmid. Action and Event Recognition with Fisher Vectors on a Compact Feature Set. In ICCV, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Over, G. Awad, M. Michel, J. Fiscus, G. Sanders, W. Kraaij, A. F. Smeaton, and G. Queenot. TRECVID 2013 -- an overview of the goals, tasks, data, evaluation mechanisms and metrics. In TRECVID, 2013.Google ScholarGoogle Scholar
  18. F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  19. D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid. Category-specific video summarization. In ECCV, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  20. M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. A database for fine grained activity detection of cooking activities. In CVPR, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  21. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, 2014.Google ScholarGoogle Scholar
  22. H. Sak, A. Senior, and F. Beaufays. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. CoRR, abs/1402.1128, 2014.Google ScholarGoogle Scholar
  23. C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A local SVM approach. In ICPR, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01.Google ScholarGoogle Scholar
  26. N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using LSTMs. In ICML, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. Sun, B. Burns, R. Nevatia, C. Snoek, B. Bolles, G. Myers, W. Wang, and E. Yeh. ISOMER: Informative segment observations for multimedia event recounting. In ICMR, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Tian, R. Sukthankar, and M. Shah. Spatiotemporal deformable part models for action detection. In CVPR, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  31. H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense trajectories and motion boundary descriptors for action recognition. IJCV, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  32. H. Wang and C. Schmid. Action Recognition with Improved Trajectories. In ICCV, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. L. Wang, Y. Qiao, and X. Tang. Video action detection with relational dynamic-poselets. In ECCV, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  34. R. J. Williams and J. Peng. An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Computation, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. W. Yang and G. Toderici. Discriminative tag learning on youtube videos with latent sub-tags. In CVPR, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Y. Yang, Y. Yang, and H. T. Shen. Effective transfer tagging from image to video. TOMM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. B. Yao and F. Li. Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. PAMI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '15: Proceedings of the 23rd ACM international conference on Multimedia
      October 2015
      1402 pages
      ISBN:9781450334594
      DOI:10.1145/2733373

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 October 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      MM '15 Paper Acceptance Rate56of252submissions,22%Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader