Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images

Authors:
Chen Sun

University of Southern California, Los Angeles, CA, USA

University of Southern California, Los Angeles, CA, USA
View Profile

,
Sanketh Shetty

Google, Inc., Mountain View, CA, USA

Google, Inc., Mountain View, CA, USA
View Profile

,
Rahul Sukthankar

Google, Inc., Mountain View, CA, USA

Google, Inc., Mountain View, CA, USA
View Profile

,
Ram Nevatia

University of Southern California, Los Angeles, CA, USA

University of Southern California, Los Angeles, CA, USA
View Profile

MM '15: Proceedings of the 23rd ACM international conference on MultimediaOctober 2015Pages 371–380https://doi.org/10.1145/2733373.2806226

Published:13 October 2015Publication History

MM '15: Proceedings of the 23rd ACM international conference on Multimedia

Pages 371–380

ABSTRACT

We address the problem of fine-grained action localization from temporally untrimmed web videos. We assume that only weak video-level annotations are available for training. The goal is to use these weak labels to identify temporal segments corresponding to the actions, and learn models that generalize to unconstrained web videos. We find that web images queried by action names serve as well-localized highlights for many actions, but are noisily labeled. To solve this problem, we propose a simple yet effective method that takes weak video labels and noisy image labels as input, and generates localized action frames as output. This is achieved by cross-domain transfer between video frames and web images, using pre-trained deep convolutional neural networks. We then use the localized action frames to train action recognition models with long short-term memory networks. We collect a fine-grained sports action data set FGA-240 of more than 130,000 YouTube videos. It has 240 fine-grained actions under 85 sports activities. Convincing results are shown on the FGA-240 data set, as well as the THUMOS 2014 localization data set with untrimmed training videos.

References

J. Chen, Y. Cui, G. Ye, D. Liu, and S. Chang. Event-driven semantic concept discovery by exploiting weakly tagged internet images. In ICMR, 2014. Google ScholarDigital Library
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.Google ScholarCross Ref
S. K. Divvala, A. Farhadi, and C. Guestrin. Learning everything about anything: Webly-supervised visual concept learning. In CVPR, 2014. Google ScholarDigital Library
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.Google ScholarCross Ref
A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, 2013.Google ScholarCross Ref
A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In NIPS, 2008.Google Scholar
A. Habibian, K. E. A. van de Sande, and C. G. M. Snoek. Recommendations for video event recognition using concept vocabularies. In ICMR, 2013. Google ScholarDigital Library
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997. Google ScholarDigital Library
M. Jain, J. van Gemert, H. Jégou, P. Bouthemy, and C. G. M. Snoek. Action localization with tubelets from motion. In CVPR, 2014. Google ScholarDigital Library
Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/, 2014.Google Scholar
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. Google ScholarDigital Library
R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. TACL, 2015.Google Scholar
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.Google ScholarDigital Library
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In ICCV, 2011. Google ScholarDigital Library
D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004. Google ScholarDigital Library
D. Oneata, J. Verbeek, and C. Schmid. Action and Event Recognition with Fisher Vectors on a Compact Feature Set. In ICCV, 2013. Google ScholarDigital Library
P. Over, G. Awad, M. Michel, J. Fiscus, G. Sanders, W. Kraaij, A. F. Smeaton, and G. Queenot. TRECVID 2013 -- an overview of the goals, tasks, data, evaluation mechanisms and metrics. In TRECVID, 2013.Google Scholar
F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007.Google ScholarCross Ref
D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid. Category-specific video summarization. In ECCV, 2014.Google ScholarCross Ref
M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. A database for fine grained activity detection of cooking activities. In CVPR, 2012.Google ScholarCross Ref
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, 2014.Google Scholar
H. Sak, A. Senior, and F. Beaufays. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. CoRR, abs/1402.1128, 2014.Google Scholar
C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A local SVM approach. In ICPR, 2004. Google ScholarDigital Library
K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.Google ScholarDigital Library
K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01.Google Scholar
N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using LSTMs. In ICML, 2015.Google ScholarDigital Library
C. Sun, B. Burns, R. Nevatia, C. Snoek, B. Bolles, G. Myers, W. Wang, and E. Yeh. ISOMER: Informative segment observations for multimedia event recounting. In ICMR, 2014. Google ScholarDigital Library
I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.Google ScholarDigital Library
Y. Tian, R. Sukthankar, and M. Shah. Spatiotemporal deformable part models for action detection. In CVPR, 2013. Google ScholarDigital Library
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.Google ScholarCross Ref
H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense trajectories and motion boundary descriptors for action recognition. IJCV, 2013.Google ScholarCross Ref
H. Wang and C. Schmid. Action Recognition with Improved Trajectories. In ICCV, 2013. Google ScholarDigital Library
L. Wang, Y. Qiao, and X. Tang. Video action detection with relational dynamic-poselets. In ECCV, 2014.Google ScholarCross Ref
R. J. Williams and J. Peng. An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Computation, 1990. Google ScholarDigital Library
W. Yang and G. Toderici. Discriminative tag learning on youtube videos with latent sub-tags. In CVPR, 2011.Google ScholarDigital Library
Y. Yang, Y. Yang, and H. T. Shen. Effective transfer tagging from image to video. TOMM, 2013. Google ScholarDigital Library
B. Yao and F. Li. Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. PAMI, 2012. Google ScholarDigital Library
J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.Google ScholarCross Ref

Index Terms

Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Video summarization

Recommendations

Weakly-Supervised Temporal Action Detection for Fine-Grained Videos with Hierarchical Atomic Actions
Computer Vision – ECCV 2022
Abstract
Action understanding has evolved into the era of fine granularity, as most human behaviors in real life have only minor differences. To detect these fine-grained actions accurately in a label-efficient way, we tackle the problem of weakly-...
Read More
Temporal Localization of Actions with Actoms

We address the problem of localizing actions, such as opening a door, in hours of challenging video data. We propose a model based on a sequence of atomic action units, termed "actoms," that are semantically meaningful and characteristic for the action. ...
Read More
Weakly supervised deep network for spatiotemporal localization and detection of human actions in wild conditions
Abstract
Human action localization in any long, untrimmed video can be determined from where and what action takes place in a given video segment. The main hurdles in human action localization are the spatiotemporal randomnesses of their happening in a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '15: Proceedings of the 23rd ACM international conference on Multimedia
October 2015
1402 pages
ISBN:9781450334594
DOI:10.1145/2733373
General Chairs:
Xiaofang Zhou
The University of Queensland, Australia
,
Alan F. Smeaton
Dublin City University, Ireland
,
Qi Tian
The University of Texas at San Antonio, USA
,
Program Chairs:
Dick C.A. Bulterman
FXPAL, USA
,
Heng Tao Shen
The University of Queensland, Australia
,
Ketan Mayer-Patel
The University of North Carolina, USA
,
Shuicheng Yan
National University of Singapore, Singapore
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 October 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
domain transfer
fine-grained action localization
lstm
Qualifiers
- research-article
Conference

Acceptance Rates
MM '15 Paper Acceptance Rate56of252submissions,22%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 91
  Total Citations
  View Citations
- 1,388
  Total Downloads
- Downloads (Last 12 months)77
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images

MM '15: Proceedings of the 23rd ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Weakly-Supervised Temporal Action Detection for Fine-Grained Videos with Hierarchical Atomic Actions

Temporal Localization of Actions with Actoms

Weakly supervised deep network for spatiotemporal localization and detection of human actions in wild conditions