skip to main content
10.1145/3242969.3264989acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
short-paper
Open Access

Multi-Feature Based Emotion Recognition for Video Clips

Authors Info & Claims
Published:02 October 2018Publication History

ABSTRACT

In this paper, we present our latest progress in Emotion Recognition techniques, which combines acoustic features and facial features in both non-temporal and temporal mode. This paper presents the details of our techniques used in the Audio-Video Emotion Recognition subtask in the 2018 Emotion Recognition in the Wild (EmotiW) Challenge. After the multimodal results fusion, our final accuracy in Acted Facial Expression in Wild (AFEW) test dataset achieves 61.87%, which is 1.53% higher than the best results last year. Such improvements prove the effectiveness of our methods.

References

  1. Abhinav Dhall, Amanjot Kaur, Roland Goecke and Tom Gedeon. EmotiW 2018: Audio-Video, Student Engagement and Group-Level Affect Prediction. Proceedings of the 20th. ACM International Conference on Multimodal Interaction 2018 (ACM ICMI 2018). October 16-20, 2018, Boulder, Colorado, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. 2012. Collecting Large, Richly Annotated Facial-Expression Databases from Movies. MultiMedia. IEEE 19, 3 (Jul. 2012), 34--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Boris Knyazev, Roman Shvetsov, Natalia Efremova and Artem Kuharenko. 2017. Convolutional neural networks pretrained on large face recognition datasets for emotion classification from video. arXiv. 1711, 04598 (Nov. 2017).Google ScholarGoogle Scholar
  4. Li Shan, Deng Weihong and Du JunPing. 2017. Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, (Jul. 2007), 2584--2593.Google ScholarGoogle Scholar
  5. K. Zhang, Z. Zhang, Z. Li and Y. Qiao. 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters. IEEE 23, 10 (Oct, 2016), 1499--1503.Google ScholarGoogle Scholar
  6. Davis E. King. 2009. Dlib-ml: A Machine Learning Toolkit. Journal of Machine Learning Research (Jul. 2009), 1755--1758. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Konar, A. and Chakraborty, A. 2014. Emotion recognition: A pattern analysis approach. John Wiley & Sons. 138--140.Google ScholarGoogle Scholar
  8. Bulat, Adrian and Tzimiropoulos, Georgios. 2017. How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks). In Proceedings of the International Conference on Computer Vision. IEEE. DIO:Google ScholarGoogle Scholar
  9. Aytar, Y., Vondrick, C. and Torralba, A., 2016. Soundnet: Learning sound representations from unlabeled video.In Advances in Neural Information Processing Systems (NIPS 2016), 892--900. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. ACM, 1459--1462. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. X. Xiong and F. De la Torre. (2013). Supervised Descent Method and Its Applications to Face Alignment. Computer Vision and Pattern Recognition. IEEE, 532--539. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Parkhi, Omkar M, Andrea Vedaldi and Andrew Zisserman. 2015. Deep face recognition. In British Machine Vision Conference, Vol. 1, No. 3, p. 6.Google ScholarGoogle Scholar
  13. Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, and others. 2013. Challenges in representation learning: A report on three machine learning contests. In International Conference on Neural Information Processing. Springer, 117--124.Google ScholarGoogle ScholarCross RefCross Ref
  14. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. Darrell, and K. Saenko. 2015. Longterm recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE. 2625--2634.Google ScholarGoogle Scholar
  15. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818--2826.Google ScholarGoogle Scholar
  16. G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition(CVPR). Vol. 1, No. 2, p. 3.Google ScholarGoogle Scholar
  17. P. Hu, D. Cai, S. Wang, A. Yao, and Y. Chen. 2017. Learning supervised scoring ensemble for emotion recognition in the wild. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM. 553--560. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Wold, Svante, Kim Esbensen, and Paul Geladi. 1987. Principal component analysis. Chemometrics and intelligent laboratory systems. Elsevier 2, 1-3, (Aug. 1987), 37--52.Google ScholarGoogle Scholar
  19. Scovanner, Paul, Saad Ali, and Mubarak Shah. 2007. A 3-dimensional sift descriptor and its application to action recognition. Proceedings of the 15th ACM international conference on Multimedia. ACM. 357--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ian J. Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, and others. 2013. Challenges in representation learning: A report on three machine learning contests. In International Conference on Neural Information Processing. Springer, 117--124.Google ScholarGoogle ScholarCross RefCross Ref
  21. Fan, Y., Lu, X., Li, D., et al. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM. 445--450. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Multi-Feature Based Emotion Recognition for Video Clips

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction
      October 2018
      687 pages
      ISBN:9781450356923
      DOI:10.1145/3242969

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 2 October 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper

      Acceptance Rates

      ICMI '18 Paper Acceptance Rate63of149submissions,42%Overall Acceptance Rate453of1,080submissions,42%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader