review-article

Free Access

Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends

Author:
Björn W. Schuller

University of Augsburg, Germany

University of Augsburg, Germany
View Profile

Authors Info & Claims

Communications of the ACM Volume 61 Issue 5May 2018pp 90–99https://doi.org/10.1145/3129340

Published:24 April 2018Publication History

Communications of the ACM

Abstract

Tracing 20 years of progress in making machines hear our emotions based on speech signal properties.

References

Abdelwahab, M. and Busso, C. Supervised domain adaptation for emotion recognition from speech. In Proceedings of ICASSP. (Brisbane, Australia, 2015). IEEE, 5058--5062.Google ScholarCross Ref
Anagnostopoulos, C.-N., Iliou, T. and Giannoukos, I. Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artificial Intelligence Review 43, 2 (2015), 155--177. Google ScholarDigital Library
Bhaykar, M., Yadav, J. and Rao, K.S. Speaker dependent, speaker independent and cross language emotion recognition from speech using GMM and HMM. In Proceedings of the National Conference on Communications. (Delhi, India, 2013). IEEE, 1--5.Google ScholarCross Ref
Blanton, S. The voice and the emotions. Q. Journal of Speech 1, 2 (1915), 154--172.Google Scholar
Chang, J. and Scherer, S. Learning Representations of Emotional Speech with Deep Convolutional Generative Adversarial Networks. arxiv.org, (arXiv:1705.02394), 2017.Google Scholar
Chen, L., Mao, X., Xue, Y. and Cheng, L.L. Speech emotion recognition: Features and classification models. Digital Signal Processing 22, 6 (2012), 1154--1160. Google ScholarDigital Library
Cibau, N.E., Albornoz. E.M., and Rufiner, H.L. Speech emotion recognition using a deep autoencoder. San Carlos de Bariloche, Argentina, 2013, 934--939.Google Scholar
Darwin, C. The Expression of Emotion in Man and Animals. Watts, 1948.Google Scholar
Davis, A., Rubinstein, M., Wadhwa, N., Mysore, G. J., Durand, F. and Freeman, W.T. The visual microphone: Passive recovery of sound from video. ACM Trans. Graphics 33, 4 (2014), 1--10. Google ScholarDigital Library
Dellaert, F., Polzin, T. and Waibel, A. Recognizing emotion in speech. In Proceedings of ICSLP 3, (Philadelphia, PA, 1996). IEEE, 1970--1973.Google ScholarCross Ref
Deng, J. Feature Transfer Learning for Speech Emotion Recognition. PhD thesis, Dissertation, Technische Universität München, Germany, 2016.Google Scholar
Deng, J., Xu, X., Zhang, Z., Frühholz, S., and Schuller B. Semisupervised Autoencoders for Speech Emotion Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26, 1 (2018), 31--43. Google ScholarDigital Library
Devillers, L., Vidrascu, L. and Lamel, L. Challenges in real-life emotion annotation and machine learning based detection. Neural Networks 18, 4 (2005), 407--422. Google ScholarDigital Library
Dhall, A., Goecke, R., Joshi, J., Sikka, K. and Gedeon, T. Emotion recognition in the wild challenge 2014: Baseline, data and protocol. In Proceedings of ICMI (Istanbul, Turkey, 2014). ACM, 461--466. Google ScholarDigital Library
El Ayadi, M., Kamel, M.S., and Karray, F. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition 44, 3 (2011), 572--587. Google ScholarDigital Library
Fairbanks, G. and Pronovost, W. Vocal pitch during simulated emotion. Science 88, 2286 (1938), 382--383.Google ScholarCross Ref
Gunes, H. and Schuller, B. Categorical and dimensional affect analysis in continuous input: Current trends and future directions. Image and Vision Computing 31, 2 (2013), 120--136. Google ScholarDigital Library
Joachims, T. Learning to classify text using support vector machines: Methods, theory and algorithms. Kluwer Academic Publishers, 2002. Google ScholarDigital Library
Kim, Y., Lee, H. and Provost, E.M. Deep learning for robust feature generation in audiovisual emotion recognition. In Proceedings of ICASSP, (Vancouver, Canada, 2013). IEEE, 3687--3691.Google ScholarCross Ref
Koolagudi, S.G. and Rao, K.S. Emotion recognition from speech: A review. Intern. J. of Speech Technology 15, 2 (2012), 99--117. Google ScholarDigital Library
Kramer, E. Elimination of verbal cues in judgments of emotion from voice. The J. Abnormal and Social Psychology 68, 4 (1964), 390.Google ScholarCross Ref
Kraus, M.W. Voice-only communication enhances empathic accuracy. American Psychologist 72, 7 (2017), 644.Google ScholarCross Ref
Lee, C.M., Narayanan, S.S., and Pieraccini, R. Combining acoustic and language information for emotion recognition. In Proceedings of INTERSPEECH, (Denver, CO, 2002). ISCA, 873--876.Google ScholarCross Ref
Leng, Y., Xu, X., and Qi, G. Combining active learning and semi-supervised learning to construct SVM classifier. Knowledge-Based Systems 44 (2013), 121--131. Google ScholarDigital Library
Liu, J., Chen, C., Bu, J., You, M. and Tao, J. Speech emotion recognition using an enhanced co-training algorithm. In Proceedings ICME. (Beijing, P.R. China, 2007). IEEE, 999--1002.Google ScholarCross Ref
Lotfian, R. and Busso, C. Emotion recognition using synthetic speech as neutral reference. In Proceedings of ICASSP. (Brisbane, Australia, 2015). IEEE, 4759--4763.Google ScholarCross Ref
Mao, Q., Dong, M., Huang, Z. and Zhan, Y. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimedia 16, 8 (2014), 2203--2213.Google ScholarCross Ref
Marsella, S. and Gratch, J. Computationally modeling human emotion. Commun. ACM 57, 12 (Dec. 2014), 56--67. Google ScholarDigital Library
Picard, R.W. and Picard, R. Affective Computing, vol. 252. MIT Press Cambridge, MA, 1997. Google ScholarDigital Library
Ram, C.S. and Ponnusamy, R. Assessment on speech emotion recognition for autism spectrum disorder children using support vector machine. World Applied Sciences J. 34, 1 (2016), 94--102.Google Scholar
Schmitt, M., Ringeval, F. and Schuller, B. At the border of acoustics and linguistics: Bag-of-audio-words for the recognition of emotions in speech. In Proceedings of INTERSPEECH. (San Francisco, CA, 2016). ISCA, 495--499.Google ScholarCross Ref
Schuller, B. and Batliner, A. Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing. Wiley, 2013. Google ScholarDigital Library
Schuller, B, Mousa, A. E.-D., and Vasileios, V. Sentiment analysis and opinion mining: On optimal parameters and performances. WIREs Data Mining and Knowledge Discovery (2015), 5:255--5:263. Google ScholarDigital Library
Soskin, W.F. and Kauffman, P.E. Judgment of emotion in word-free voice samples. J. of Commun. 11, 2 (1961), 73--80.Google ScholarCross Ref
Stuhlsatz, A., Meyer, C., Eyben, F., Zielke, T., Meier, G. and Schuller, B. Deep neural networks for acoustic emotion recognition: Raising the benchmarks. In Proceedings of ICASSP. (Prague, Czech Republic, 2011). IEEE,5688--5691.Google ScholarCross Ref
Tosa, N. and Nakatsu, R. Life-like communication agent-emotion sensing character 'MIC' and feeling session character 'MUSE.' In Proceedings of the 3rd International Conference on Multimedia Computing and Systems. (Hiroshima, Japan, 1996). IEEE, 12--19. Google ScholarDigital Library
Trigeorgis, G., Ringeval, F., Brückner, R., Marchi, E., Nicolaou, M., Schuller, B. and Zafeiriou, S. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of ICASSP. (Shanghai, P.R. China, 2016). IEEE, 5200--5204.Google ScholarDigital Library
Ververidis, D. and Kotropoulos, C. Emotional speech recognition: Resources, features, and methods. Speech Commun. 48, 9 (2006), 1162--1181.Google ScholarCross Ref
Watson, D., Clark, L.A., and Tellegen, A. Development and validation of brief measures of positive and negative affect: the PANAS scales. J. of Personality and Social Psychology 54, 6 (1988), 1063.Google ScholarCross Ref
Weninger, F., Eyben, F., Schuller, B.W., Mortillaro, M., and Scherer, K.R. On the acoustics of emotion in audio: What speech, music and sound have in common. Frontiers in Psychology 4, Article ID 292 (2013), 1--12.Google Scholar
Williamson, J. Speech analyzer for analyzing pitch or frequency perturbations in individual speech pattern to determine the emotional state of the person. U.S. Patent 4,093,821, 1978.Google Scholar
Wöllmer, M., Eyben, F., Reiter, S., Schuller, B., Cox, C., Douglas-Cowie, E. and Cowie, R. Abandoning emotion classes--- Towards continuous emotion recognition with modeling of long-range dependencies. In Proceedings of INTERSPEECH. (Brisbane, Australia, 2008). ISCA, 597--600.Google ScholarCross Ref
Zeng, Z., Pantic, M., Roisman, G.I., and Huang, T.S. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans. Pattern Analysis and Machine Intelligence 31, 1 (2009), 39--58. Google ScholarDigital Library

Index Terms

Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition

Recommendations

Emotion recognition from speech: a review

Emotion recognition from speech has emerged as an important research area in the recent past. In this regard, review of existing work on emotional speech processing is useful for carrying out further research. In this paper, the recent literature on ...
Read More
Application of Emotion Recognition and Modification for Emotional Telugu Speech Recognition
Abstract
Majority of the automatic speech recognition systems (ASR) are trained with neutral speech and the performance of these systems are affected due to the presence of emotional content in the speech. The recognition of these emotions in human speech ...
Read More
Emotion Recognition in Continuous Mandarin Chinese Speech: Visualizing Emotional Expression from Continuous Speech in a 2D Emotional Space
Read More

Reviews

Reviewer: Jonathan P. E. Hodgson

The two decades referred to in the subtitle essentially span the time since the publication of Picard's foundational Affective computing [1], which began the study of emotion recognition by computers. This paper can therefore be viewed as a comprehensive review of emotion recognition in speech. The author begins by laying out an overall view of the process. In gross terms, the process has four components. First, one chooses the model for emotions, either discrete classes or a value continuous dimensional view composed of axes for arousal and positivity. Then one acquires labeled data. Following this, features are selected that are then fed into a learning system. Initially, the labeling of the data required extensive human intervention with the ambiguities that this implies, but now systems exist where the machine can learn to label the data with some human intervention. This is an iterative process where human advice is used to learn labels. Features can be chunks of audio rather than just words. It is also important to take into account the speaker's states and traits beyond the emotion of interest. The author summarizes the results of recent speech emotion recognition (SER) challenge events in a useful table. Finally, the author considers challenges that the SER community could undertake. Going beyond the recognition of irony or sarcasm, the author suggests what he calls a "moonshot challenge" to target the actual emotion of the speaker. The review illuminates a fascinating area and leaves the reader eager for more. There is a comprehensive bibliography.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Communications of the ACM Volume 61, Issue 5
May 2018
104 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/3210350
Editor:
Andrew A. Chien
Association for Computing Machinery, New York, NY
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 April 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- review-article
- Popular
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 278
  Total Citations
  View Citations
- 17,145
  Total Downloads
- Downloads (Last 12 months)540
- Downloads (Last 6 weeks)144
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Emotion recognition from speech: a review

Application of Emotion Recognition and Modification for Emotional Telugu Speech Recognition

Emotion Recognition in Continuous Mandarin Chinese Speech: Visualizing Emotional Expression from Continuous Speech in a 2D Emotional Space

Reviews

Access critical reviews of Computing literature here