skip to main content
review-article
Free Access

Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends

Published:24 April 2018Publication History
Skip Abstract Section

Abstract

Tracing 20 years of progress in making machines hear our emotions based on speech signal properties.

References

  1. Abdelwahab, M. and Busso, C. Supervised domain adaptation for emotion recognition from speech. In Proceedings of ICASSP. (Brisbane, Australia, 2015). IEEE, 5058--5062.Google ScholarGoogle ScholarCross RefCross Ref
  2. Anagnostopoulos, C.-N., Iliou, T. and Giannoukos, I. Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artificial Intelligence Review 43, 2 (2015), 155--177. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bhaykar, M., Yadav, J. and Rao, K.S. Speaker dependent, speaker independent and cross language emotion recognition from speech using GMM and HMM. In Proceedings of the National Conference on Communications. (Delhi, India, 2013). IEEE, 1--5.Google ScholarGoogle ScholarCross RefCross Ref
  4. Blanton, S. The voice and the emotions. Q. Journal of Speech 1, 2 (1915), 154--172.Google ScholarGoogle Scholar
  5. Chang, J. and Scherer, S. Learning Representations of Emotional Speech with Deep Convolutional Generative Adversarial Networks. arxiv.org, (arXiv:1705.02394), 2017.Google ScholarGoogle Scholar
  6. Chen, L., Mao, X., Xue, Y. and Cheng, L.L. Speech emotion recognition: Features and classification models. Digital Signal Processing 22, 6 (2012), 1154--1160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Cibau, N.E., Albornoz. E.M., and Rufiner, H.L. Speech emotion recognition using a deep autoencoder. San Carlos de Bariloche, Argentina, 2013, 934--939.Google ScholarGoogle Scholar
  8. Darwin, C. The Expression of Emotion in Man and Animals. Watts, 1948.Google ScholarGoogle Scholar
  9. Davis, A., Rubinstein, M., Wadhwa, N., Mysore, G. J., Durand, F. and Freeman, W.T. The visual microphone: Passive recovery of sound from video. ACM Trans. Graphics 33, 4 (2014), 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dellaert, F., Polzin, T. and Waibel, A. Recognizing emotion in speech. In Proceedings of ICSLP 3, (Philadelphia, PA, 1996). IEEE, 1970--1973.Google ScholarGoogle ScholarCross RefCross Ref
  11. Deng, J. Feature Transfer Learning for Speech Emotion Recognition. PhD thesis, Dissertation, Technische Universität München, Germany, 2016.Google ScholarGoogle Scholar
  12. Deng, J., Xu, X., Zhang, Z., Frühholz, S., and Schuller B. Semisupervised Autoencoders for Speech Emotion Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26, 1 (2018), 31--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Devillers, L., Vidrascu, L. and Lamel, L. Challenges in real-life emotion annotation and machine learning based detection. Neural Networks 18, 4 (2005), 407--422. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dhall, A., Goecke, R., Joshi, J., Sikka, K. and Gedeon, T. Emotion recognition in the wild challenge 2014: Baseline, data and protocol. In Proceedings of ICMI (Istanbul, Turkey, 2014). ACM, 461--466. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. El Ayadi, M., Kamel, M.S., and Karray, F. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition 44, 3 (2011), 572--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Fairbanks, G. and Pronovost, W. Vocal pitch during simulated emotion. Science 88, 2286 (1938), 382--383.Google ScholarGoogle ScholarCross RefCross Ref
  17. Gunes, H. and Schuller, B. Categorical and dimensional affect analysis in continuous input: Current trends and future directions. Image and Vision Computing 31, 2 (2013), 120--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Joachims, T. Learning to classify text using support vector machines: Methods, theory and algorithms. Kluwer Academic Publishers, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kim, Y., Lee, H. and Provost, E.M. Deep learning for robust feature generation in audiovisual emotion recognition. In Proceedings of ICASSP, (Vancouver, Canada, 2013). IEEE, 3687--3691.Google ScholarGoogle ScholarCross RefCross Ref
  20. Koolagudi, S.G. and Rao, K.S. Emotion recognition from speech: A review. Intern. J. of Speech Technology 15, 2 (2012), 99--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Kramer, E. Elimination of verbal cues in judgments of emotion from voice. The J. Abnormal and Social Psychology 68, 4 (1964), 390.Google ScholarGoogle ScholarCross RefCross Ref
  22. Kraus, M.W. Voice-only communication enhances empathic accuracy. American Psychologist 72, 7 (2017), 644.Google ScholarGoogle ScholarCross RefCross Ref
  23. Lee, C.M., Narayanan, S.S., and Pieraccini, R. Combining acoustic and language information for emotion recognition. In Proceedings of INTERSPEECH, (Denver, CO, 2002). ISCA, 873--876.Google ScholarGoogle ScholarCross RefCross Ref
  24. Leng, Y., Xu, X., and Qi, G. Combining active learning and semi-supervised learning to construct SVM classifier. Knowledge-Based Systems 44 (2013), 121--131. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Liu, J., Chen, C., Bu, J., You, M. and Tao, J. Speech emotion recognition using an enhanced co-training algorithm. In Proceedings ICME. (Beijing, P.R. China, 2007). IEEE, 999--1002.Google ScholarGoogle ScholarCross RefCross Ref
  26. Lotfian, R. and Busso, C. Emotion recognition using synthetic speech as neutral reference. In Proceedings of ICASSP. (Brisbane, Australia, 2015). IEEE, 4759--4763.Google ScholarGoogle ScholarCross RefCross Ref
  27. Mao, Q., Dong, M., Huang, Z. and Zhan, Y. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimedia 16, 8 (2014), 2203--2213.Google ScholarGoogle ScholarCross RefCross Ref
  28. Marsella, S. and Gratch, J. Computationally modeling human emotion. Commun. ACM 57, 12 (Dec. 2014), 56--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Picard, R.W. and Picard, R. Affective Computing, vol. 252. MIT Press Cambridge, MA, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ram, C.S. and Ponnusamy, R. Assessment on speech emotion recognition for autism spectrum disorder children using support vector machine. World Applied Sciences J. 34, 1 (2016), 94--102.Google ScholarGoogle Scholar
  31. Schmitt, M., Ringeval, F. and Schuller, B. At the border of acoustics and linguistics: Bag-of-audio-words for the recognition of emotions in speech. In Proceedings of INTERSPEECH. (San Francisco, CA, 2016). ISCA, 495--499.Google ScholarGoogle ScholarCross RefCross Ref
  32. Schuller, B. and Batliner, A. Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing. Wiley, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Schuller, B, Mousa, A. E.-D., and Vasileios, V. Sentiment analysis and opinion mining: On optimal parameters and performances. WIREs Data Mining and Knowledge Discovery (2015), 5:255--5:263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Soskin, W.F. and Kauffman, P.E. Judgment of emotion in word-free voice samples. J. of Commun. 11, 2 (1961), 73--80.Google ScholarGoogle ScholarCross RefCross Ref
  35. Stuhlsatz, A., Meyer, C., Eyben, F., Zielke, T., Meier, G. and Schuller, B. Deep neural networks for acoustic emotion recognition: Raising the benchmarks. In Proceedings of ICASSP. (Prague, Czech Republic, 2011). IEEE,5688--5691.Google ScholarGoogle ScholarCross RefCross Ref
  36. Tosa, N. and Nakatsu, R. Life-like communication agent-emotion sensing character 'MIC' and feeling session character 'MUSE.' In Proceedings of the 3rd International Conference on Multimedia Computing and Systems. (Hiroshima, Japan, 1996). IEEE, 12--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Trigeorgis, G., Ringeval, F., Brückner, R., Marchi, E., Nicolaou, M., Schuller, B. and Zafeiriou, S. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of ICASSP. (Shanghai, P.R. China, 2016). IEEE, 5200--5204.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ververidis, D. and Kotropoulos, C. Emotional speech recognition: Resources, features, and methods. Speech Commun. 48, 9 (2006), 1162--1181.Google ScholarGoogle ScholarCross RefCross Ref
  39. Watson, D., Clark, L.A., and Tellegen, A. Development and validation of brief measures of positive and negative affect: the PANAS scales. J. of Personality and Social Psychology 54, 6 (1988), 1063.Google ScholarGoogle ScholarCross RefCross Ref
  40. Weninger, F., Eyben, F., Schuller, B.W., Mortillaro, M., and Scherer, K.R. On the acoustics of emotion in audio: What speech, music and sound have in common. Frontiers in Psychology 4, Article ID 292 (2013), 1--12.Google ScholarGoogle Scholar
  41. Williamson, J. Speech analyzer for analyzing pitch or frequency perturbations in individual speech pattern to determine the emotional state of the person. U.S. Patent 4,093,821, 1978.Google ScholarGoogle Scholar
  42. Wöllmer, M., Eyben, F., Reiter, S., Schuller, B., Cox, C., Douglas-Cowie, E. and Cowie, R. Abandoning emotion classes--- Towards continuous emotion recognition with modeling of long-range dependencies. In Proceedings of INTERSPEECH. (Brisbane, Australia, 2008). ISCA, 597--600.Google ScholarGoogle ScholarCross RefCross Ref
  43. Zeng, Z., Pantic, M., Roisman, G.I., and Huang, T.S. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans. Pattern Analysis and Machine Intelligence 31, 1 (2009), 39--58. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends

    Recommendations

    Reviews

    Jonathan P. E. Hodgson

    The two decades referred to in the subtitle essentially span the time since the publication of Picard's foundational Affective computing [1], which began the study of emotion recognition by computers. This paper can therefore be viewed as a comprehensive review of emotion recognition in speech. The author begins by laying out an overall view of the process. In gross terms, the process has four components. First, one chooses the model for emotions, either discrete classes or a value continuous dimensional view composed of axes for arousal and positivity. Then one acquires labeled data. Following this, features are selected that are then fed into a learning system. Initially, the labeling of the data required extensive human intervention with the ambiguities that this implies, but now systems exist where the machine can learn to label the data with some human intervention. This is an iterative process where human advice is used to learn labels. Features can be chunks of audio rather than just words. It is also important to take into account the speaker's states and traits beyond the emotion of interest. The author summarizes the results of recent speech emotion recognition (SER) challenge events in a useful table. Finally, the author considers challenges that the SER community could undertake. Going beyond the recognition of irony or sarcasm, the author suggests what he calls a "moonshot challenge" to target the actual emotion of the speaker. The review illuminates a fascinating area and leaves the reader eager for more. There is a comprehensive bibliography.

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Communications of the ACM
      Communications of the ACM  Volume 61, Issue 5
      May 2018
      104 pages
      ISSN:0001-0782
      EISSN:1557-7317
      DOI:10.1145/3210350
      Issue’s Table of Contents

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 April 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • review-article
      • Popular
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format