skip to main content
research-article

A deep learning approach for generalized speech animation

Published:20 July 2017Publication History
Skip Abstract Section

Abstract

We introduce a simple and effective deep learning approach to automatically generate natural looking speech animation that synchronizes to input speech. Our approach uses a sliding window predictor that learns arbitrary nonlinear mappings from phoneme label input sequences to mouth movements in a way that accurately captures natural motion and visual coarticulation effects. Our deep learning approach enjoys several attractive properties: it runs in real-time, requires minimal parameter tuning, generalizes well to novel input speech sequences, is easily edited to create stylized and emotional speech, and is compatible with existing animation retargeting approaches. One important focus of our work is to develop an effective approach for speech animation that can be easily integrated into existing production pipelines. We provide a detailed description of our end-to-end approach, including machine learning design decisions. Generalized speech animation results are demonstrated over a wide range of animation clips on a variety of characters and voices, including singing and foreign language input. Our approach can also generate on-demand speech animation in real-time from user speech input.

Skip Supplemental Material Section

Supplemental Material

References

  1. Robert Anderson, Bjorn Stenger, Vincent Wan, and Roberto Cipolla. 2013. Expressive Visual Text-To-Speech Using Active Appearance Models. In Proccedings of the International Conference on Computer Vision and Pattern Recognition. 3382--3389. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google ScholarGoogle Scholar
  3. Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. 2012. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop. (2012).Google ScholarGoogle Scholar
  4. Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman, Robert W Sumner, and Markus Gross. 2011. High-quality passive facial performance capture using anchor frames. ACM Transactions on Graphics 30 (Aug. 2011), 75:1--75:10. Issue 4.Google ScholarGoogle Scholar
  5. Matthew Brand. 1999. Voice Puppetry. In Proceedings of SIGGRAPH. ACM, 21--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: Driving Visual Speech with Audio. In Proceedings of SIGGRAPH. 353--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. 2015. Real-time high-fidelity facial performance capture. ACM Transactions on Graphics 34, 4 (2015), 46.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chen Cao, Yanlin Weng, Stephen Lin, and Kun Zhou. 2013. 3D Shape Regression for Real-time Facial Animation. ACM Transactions on Graphics 32, 4 (2013), 41:1--41:10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Yong Cao, Wen C Tien, Petros Faloutsos, and Frédéric Pighin. 2005. Expressive Speech-Driven Facial Animation. ACM Transactions on Graphics 24, 4 (2005), 1283 -- 1302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Rich Caruana and Alexandru Niculescu-Mizil. 2006. An empirical comparison of supervised learning algorithms. In International Conference on Machine Learning (ICML). 161--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Michael M Cohen, Dominic W Massaro, and others. 1994. Modeling Coarticualtion in Synthetic Visual Speech. In Models and Techniques in Computer Animation, N.M. Thalmann and Thalmann D (Eds.). Springer-Verlag, 141--155.Google ScholarGoogle Scholar
  12. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, Aug (2011), 2493--2537.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. 2001. Active Appearance Models. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 6 (2001), 681--685. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Eric Cosatto and Hans Peter Graf. 2000. Photo-realistic Talking-heads from Image Samples. IEEE Transactions on Multimedia 2, 3 (2000), 152--163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. José Mario De Martino, Léo Pini Magalhães, and Fábio Violaro. 2006. Facial animation based on context-dependent visemes. Journal of Computers and Graphics 30, 6 (2006), 971 -- 980. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Salil Deena, Shaobo Hou, and Aphrodite Galata. 2010. Visual speech synthesis by modelling coarticulation dynamics using a non-parametric switching state-space model. In Proceedings of the International Conference on Multimodal Interfaces. 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics (TOG) 35, 4 (2016), 127.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Gwenn Englebienne, Timothy F Cootes, and Magnus Rattray. 2007. A Probabilistic Model for Generating Realistic Speech Movements from Speech. In Proceedings of Advances in Natural Information Processing Systems. 401--408.Google ScholarGoogle Scholar
  19. Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable Videorealistic Speech Animation. In ACM Transactions on Graphics. 388--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Bo Fan, Lijuan Wang, Frank K Soong, and Lei Xie. 2015. Photo-real Talking Head with Deep Bidirectional LSTM. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. IEEE, 4884--4888.Google ScholarGoogle ScholarCross RefCross Ref
  21. Shengli Fu, Ricardo Gutierrez-Osuna, Anna Esposito, Praveen K Kakumanu, and Oscar N Garcia. 2005. Audio/visual mapping with cross-modal hidden Markov models. IEEE Transactions on Multimedia 7, 2 (2005), 243--252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Graham Fyfe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, and Paul Debevec. 2014. Driving High-Resolution Facial Scans with Video Performance Capture. ACM Transactions on Graphics 34, 1 (2014), 8.Google ScholarGoogle Scholar
  23. John S Garofolo, Lori F Lamel, William M Fisher, Jonathon G Fiscus, and David S Pallett. 1993. Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM TIMIT. Technical Report 4930. NIST.Google ScholarGoogle Scholar
  24. Oxana Govokhina, Gérard Bailly, Gaspard Breton, and Paul Bagshaw. 2006. TDA: A new trainable trajectory formation system for facial animation. In Proceedings of Interspeech. 2474--2477.Google ScholarGoogle Scholar
  25. Alex Graves and Navdeep Jaitly. 2014. Towards End-To-End Speech Recognition with Recurrent Neural Networks. In ICML, Vol. 14. 1764--1772.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Haoda Huang, Jinxiang Chai, Xin Tong, and Hsiang-Tao Wu. 2011. Leveraging motion capture and 3d scanning for high-fidelity facial performance acquisition. In ACM Transactions on Graphics, Vol. 30. ACM, 74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Taehwan Kim, Yisong Yue, Sarah Taylor, and Iain Matthews. 2015. A Decision Tree Framework for Spatiotemporal Sequence Prediction. In ACM Conference on Knowledge Discovery and Data Mining. 577--586. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems. 1097--1105.Google ScholarGoogle Scholar
  30. Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. 2013. Realtime Facial Animation with On-the-fly Correctives. ACM Transactions on Graphics 32, 4 (2013), 42--1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Kang Liu and Joern Ostermann. 2012. Evaluation of an image-based talking head with realistic facial expression and head motion. Multimodal User Interfaces 5 (2012), 37--44. Google ScholarGoogle ScholarCross RefCross Ref
  32. Changwei Luo, Jun Yu, Xian Li, and Zengfu Wang. 2014. Realtime speech-driven facial animation using Gaussian Mixture Models. In IEEE Conference on Multimedia and Expo Workshops. 1--6.Google ScholarGoogle Scholar
  33. Jiyong Ma, Ron Cole, Bryan Pellom, Wayne Ward, and Barbara Wise. 2006. Accurate Visible Speech Synthesis Based on Concatenating Variable Length Motion Capture Data. IEEE Transactions on Visualization and Computer Graphics 12, 2 (2006), 266--276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Iain Matthews and Simon Baker. 2004. Active Appearance Models Revisited. International Journal of Computer Vision 60, 2 (2004), 135--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Wesley Mattheyses, Lukas Latacz, and Werner Verhelst. 2013. Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis. Speech Communication 55, 7--8 (2013), 857--876.Google ScholarGoogle Scholar
  36. Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264 (Dec. 1976), 746--748. Google ScholarGoogle ScholarCross RefCross Ref
  37. Thomas Merritt and Simon King. 2013. Investigating the shortcomings of HMM synthesis. In ISCA Workshop on Speech Synthesis. 185--190.Google ScholarGoogle Scholar
  38. Julian James Odell. 1995. The Use of Context in Large Vocabulary Speech Recognition. Ph.D. Dissertation. Cambridge University.Google ScholarGoogle Scholar
  39. Dietmar Schabus, Michael Pucher, and Gregor Hofer. 2011. Simultaneous Speech and Animation Synthesis. In ACM SIGGRAPH Posters. 8:1--8:1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research (JMLR) 15, 1 (2014), 1929--1958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Robert W Sumner and Jovan Popović. 2004. Deformation transfer for triangle meshes. ACM Transactions on Graphics 23, 3 (2004), 399--405. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Neural Information Processing Systemsw. 3104--3112.Google ScholarGoogle Scholar
  43. Sarah L Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic Units of Visual Speech. In Proceedings of ACM SIGGRAPH/Eurographics Symposium on Computer Animation. Eurographics Association, 275--284.Google ScholarGoogle Scholar
  44. Barry-John Theobald and Iain Matthews. 2012. Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers. IEEE Transactions on Audio, Speech and Language Processing 20, 8 (2012), 2378.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).Google ScholarGoogle Scholar
  46. Lijuan Wang, Wei Han, and Frank K Soong. 2012. High quality lip-sync animation for 3D photo-realistic talking head. In IEEE Conference on Acoustics, Speech and Signal Processing. IEEE, 4529--4532. Google ScholarGoogle ScholarCross RefCross Ref
  47. Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. 2011. Realtime Performance-based Facial Animation. In ACM Transactions on Graphics (TOG), Vol. 30. 77:1--77:10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Yanlin Weng, Chen Cao, Qiming Hou, and Kun Zhou. 2014. Real-time facial animation on mobile devices. Graphical Models 76, 3 (2014), 172--179. Google ScholarGoogle ScholarCross RefCross Ref
  49. Lei Xie and Zhi-Qiang Liu. 2007. A coupled HMM approach to video-realistic speech animation. Pattern Recognition 40, 8 (2007), 2325--2340. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044 2, 3 (2015), 5.Google ScholarGoogle Scholar
  51. Yuyu Xu, Andrew W Feng, Stacy Marsella, and Ari Shapiro. 2013. A Practical and Configurable Lip Sync Method for Games. In Proc. ACM SIGGRAPH Motion in Games. 131--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, and others. 2006. The HTK Book. Cambridge University.Google ScholarGoogle Scholar
  53. Jiahong Yuan and Mark Liberman. 2008. Speaker Identification on the SCOTUS Corpus. Journal of the Acoustical Society of America 123, 5 (2008). Google ScholarGoogle ScholarCross RefCross Ref
  54. Heiga Zen, Takashi Nose, Junichi Yamagishi, Shinji Sako, Takashi Masuko, Alan Black, and Keiichi Tokuda. 2007. The HMM-based speech synthesis system version 2.0. In Proceedings of the Speech Synthesis Workshop. 294--299.Google ScholarGoogle Scholar
  55. Li Zhang, Noah Snavely, Brian Curless, and Steven M Seitz. 2004. Spacetime Faces: High Resolution Capture for Modeling and Animation. In ACM Transactions on Graphics. 548--558.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A deep learning approach for generalized speech animation

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Graphics
              ACM Transactions on Graphics  Volume 36, Issue 4
              August 2017
              2155 pages
              ISSN:0730-0301
              EISSN:1557-7368
              DOI:10.1145/3072959
              Issue’s Table of Contents

              Copyright © 2017 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 20 July 2017
              Published in tog Volume 36, Issue 4

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader