research-article

A deep learning approach for generalized speech animation

Authors:
Sarah Taylor

University of East Anglia

University of East Anglia
View Profile

,
Taehwan Kim

California Institute of Technology

California Institute of Technology
View Profile

,
Yisong Yue

California Institute of Technology

California Institute of Technology
View Profile

,
Moshe Mahler

Disney Research

Disney Research
View Profile

,
James Krahe

Disney Research

Disney Research
View Profile

,
Anastasio Garcia Rodriguez

Disney Research

Disney Research
View Profile

,
Jessica Hodgins

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Iain Matthews

Disney Research

Disney Research
View Profile

Authors Info & Claims

ACM Transactions on Graphics Volume 36 Issue 4Article No.: 93pp 1–11https://doi.org/10.1145/3072959.3073699

Published:20 July 2017Publication History

ACM Transactions on Graphics

Abstract

We introduce a simple and effective deep learning approach to automatically generate natural looking speech animation that synchronizes to input speech. Our approach uses a sliding window predictor that learns arbitrary nonlinear mappings from phoneme label input sequences to mouth movements in a way that accurately captures natural motion and visual coarticulation effects. Our deep learning approach enjoys several attractive properties: it runs in real-time, requires minimal parameter tuning, generalizes well to novel input speech sequences, is easily edited to create stylized and emotional speech, and is compatible with existing animation retargeting approaches. One important focus of our work is to develop an effective approach for speech animation that can be easily integrated into existing production pipelines. We provide a detailed description of our end-to-end approach, including machine learning design decisions. Generalized speech animation results are demonstrated over a wide range of animation clips on a variety of characters and voices, including singing and foreign language input. Our approach can also generate on-demand speech animation in real-time from user speech input.

Supplemental Material

Available for Download

zip

a93-taylor.zip (162.5 MB)

Supplemental files.

References

Robert Anderson, Bjorn Stenger, Vincent Wan, and Roberto Cipolla. 2013. Expressive Visual Text-To-Speech Using Active Appearance Models. In Proccedings of the International Conference on Computer Vision and Pattern Recognition. 3382--3389. Google ScholarDigital Library
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. 2012. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop. (2012).Google Scholar
Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman, Robert W Sumner, and Markus Gross. 2011. High-quality passive facial performance capture using anchor frames. ACM Transactions on Graphics 30 (Aug. 2011), 75:1--75:10. Issue 4.Google Scholar
Matthew Brand. 1999. Voice Puppetry. In Proceedings of SIGGRAPH. ACM, 21--28. Google ScholarDigital Library
Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: Driving Visual Speech with Audio. In Proceedings of SIGGRAPH. 353--360. Google ScholarDigital Library
Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. 2015. Real-time high-fidelity facial performance capture. ACM Transactions on Graphics 34, 4 (2015), 46.Google ScholarDigital Library
Chen Cao, Yanlin Weng, Stephen Lin, and Kun Zhou. 2013. 3D Shape Regression for Real-time Facial Animation. ACM Transactions on Graphics 32, 4 (2013), 41:1--41:10.Google ScholarDigital Library
Yong Cao, Wen C Tien, Petros Faloutsos, and Frédéric Pighin. 2005. Expressive Speech-Driven Facial Animation. ACM Transactions on Graphics 24, 4 (2005), 1283 -- 1302. Google ScholarDigital Library
Rich Caruana and Alexandru Niculescu-Mizil. 2006. An empirical comparison of supervised learning algorithms. In International Conference on Machine Learning (ICML). 161--168. Google ScholarDigital Library
Michael M Cohen, Dominic W Massaro, and others. 1994. Modeling Coarticualtion in Synthetic Visual Speech. In Models and Techniques in Computer Animation, N.M. Thalmann and Thalmann D (Eds.). Springer-Verlag, 141--155.Google Scholar
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, Aug (2011), 2493--2537.Google ScholarDigital Library
Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. 2001. Active Appearance Models. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 6 (2001), 681--685. Google ScholarDigital Library
Eric Cosatto and Hans Peter Graf. 2000. Photo-realistic Talking-heads from Image Samples. IEEE Transactions on Multimedia 2, 3 (2000), 152--163. Google ScholarDigital Library
José Mario De Martino, Léo Pini Magalhães, and Fábio Violaro. 2006. Facial animation based on context-dependent visemes. Journal of Computers and Graphics 30, 6 (2006), 971 -- 980. Google ScholarDigital Library
Salil Deena, Shaobo Hou, and Aphrodite Galata. 2010. Visual speech synthesis by modelling coarticulation dynamics using a non-parametric switching state-space model. In Proceedings of the International Conference on Multimodal Interfaces. 1--8. Google ScholarDigital Library
Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics (TOG) 35, 4 (2016), 127.Google ScholarDigital Library
Gwenn Englebienne, Timothy F Cootes, and Magnus Rattray. 2007. A Probabilistic Model for Generating Realistic Speech Movements from Speech. In Proceedings of Advances in Natural Information Processing Systems. 401--408.Google Scholar
Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable Videorealistic Speech Animation. In ACM Transactions on Graphics. 388--398. Google ScholarDigital Library
Bo Fan, Lijuan Wang, Frank K Soong, and Lei Xie. 2015. Photo-real Talking Head with Deep Bidirectional LSTM. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. IEEE, 4884--4888.Google ScholarCross Ref
Shengli Fu, Ricardo Gutierrez-Osuna, Anna Esposito, Praveen K Kakumanu, and Oscar N Garcia. 2005. Audio/visual mapping with cross-modal hidden Markov models. IEEE Transactions on Multimedia 7, 2 (2005), 243--252.Google ScholarDigital Library
Graham Fyfe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, and Paul Debevec. 2014. Driving High-Resolution Facial Scans with Video Performance Capture. ACM Transactions on Graphics 34, 1 (2014), 8.Google Scholar
John S Garofolo, Lori F Lamel, William M Fisher, Jonathon G Fiscus, and David S Pallett. 1993. Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM TIMIT. Technical Report 4930. NIST.Google Scholar
Oxana Govokhina, Gérard Bailly, Gaspard Breton, and Paul Bagshaw. 2006. TDA: A new trainable trajectory formation system for facial animation. In Proceedings of Interspeech. 2474--2477.Google Scholar
Alex Graves and Navdeep Jaitly. 2014. Towards End-To-End Speech Recognition with Recurrent Neural Networks. In ICML, Vol. 14. 1764--1772.Google ScholarDigital Library
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780. Google ScholarDigital Library
Haoda Huang, Jinxiang Chai, Xin Tong, and Hsiang-Tao Wu. 2011. Leveraging motion capture and 3d scanning for high-fidelity facial performance acquisition. In ACM Transactions on Graphics, Vol. 30. ACM, 74. Google ScholarDigital Library
Taehwan Kim, Yisong Yue, Sarah Taylor, and Iain Matthews. 2015. A Decision Tree Framework for Spatiotemporal Sequence Prediction. In ACM Conference on Knowledge Discovery and Data Mining. 577--586. Google ScholarDigital Library
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems. 1097--1105.Google Scholar
Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. 2013. Realtime Facial Animation with On-the-fly Correctives. ACM Transactions on Graphics 32, 4 (2013), 42--1. Google ScholarDigital Library
Kang Liu and Joern Ostermann. 2012. Evaluation of an image-based talking head with realistic facial expression and head motion. Multimodal User Interfaces 5 (2012), 37--44. Google ScholarCross Ref
Changwei Luo, Jun Yu, Xian Li, and Zengfu Wang. 2014. Realtime speech-driven facial animation using Gaussian Mixture Models. In IEEE Conference on Multimedia and Expo Workshops. 1--6.Google Scholar
Jiyong Ma, Ron Cole, Bryan Pellom, Wayne Ward, and Barbara Wise. 2006. Accurate Visible Speech Synthesis Based on Concatenating Variable Length Motion Capture Data. IEEE Transactions on Visualization and Computer Graphics 12, 2 (2006), 266--276. Google ScholarDigital Library
Iain Matthews and Simon Baker. 2004. Active Appearance Models Revisited. International Journal of Computer Vision 60, 2 (2004), 135--164. Google ScholarDigital Library
Wesley Mattheyses, Lukas Latacz, and Werner Verhelst. 2013. Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis. Speech Communication 55, 7--8 (2013), 857--876.Google Scholar
Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264 (Dec. 1976), 746--748. Google ScholarCross Ref
Thomas Merritt and Simon King. 2013. Investigating the shortcomings of HMM synthesis. In ISCA Workshop on Speech Synthesis. 185--190.Google Scholar
Julian James Odell. 1995. The Use of Context in Large Vocabulary Speech Recognition. Ph.D. Dissertation. Cambridge University.Google Scholar
Dietmar Schabus, Michael Pucher, and Gregor Hofer. 2011. Simultaneous Speech and Animation Synthesis. In ACM SIGGRAPH Posters. 8:1--8:1. Google ScholarDigital Library
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research (JMLR) 15, 1 (2014), 1929--1958.Google ScholarDigital Library
Robert W Sumner and Jovan Popović. 2004. Deformation transfer for triangle meshes. ACM Transactions on Graphics 23, 3 (2004), 399--405. Google ScholarDigital Library
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Neural Information Processing Systemsw. 3104--3112.Google Scholar
Sarah L Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic Units of Visual Speech. In Proceedings of ACM SIGGRAPH/Eurographics Symposium on Computer Animation. Eurographics Association, 275--284.Google Scholar
Barry-John Theobald and Iain Matthews. 2012. Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers. IEEE Transactions on Audio, Speech and Language Processing 20, 8 (2012), 2378.Google ScholarDigital Library
Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).Google Scholar
Lijuan Wang, Wei Han, and Frank K Soong. 2012. High quality lip-sync animation for 3D photo-realistic talking head. In IEEE Conference on Acoustics, Speech and Signal Processing. IEEE, 4529--4532. Google ScholarCross Ref
Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. 2011. Realtime Performance-based Facial Animation. In ACM Transactions on Graphics (TOG), Vol. 30. 77:1--77:10. Google ScholarDigital Library
Yanlin Weng, Chen Cao, Qiming Hou, and Kun Zhou. 2014. Real-time facial animation on mobile devices. Graphical Models 76, 3 (2014), 172--179. Google ScholarCross Ref
Lei Xie and Zhi-Qiang Liu. 2007. A coupled HMM approach to video-realistic speech animation. Pattern Recognition 40, 8 (2007), 2325--2340. Google ScholarDigital Library
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044 2, 3 (2015), 5.Google Scholar
Yuyu Xu, Andrew W Feng, Stacy Marsella, and Ari Shapiro. 2013. A Practical and Configurable Lip Sync Method for Games. In Proc. ACM SIGGRAPH Motion in Games. 131--140. Google ScholarDigital Library
Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, and others. 2006. The HTK Book. Cambridge University.Google Scholar
Jiahong Yuan and Mark Liberman. 2008. Speaker Identification on the SCOTUS Corpus. Journal of the Acoustical Society of America 123, 5 (2008). Google ScholarCross Ref
Heiga Zen, Takashi Nose, Junichi Yamagishi, Shinji Sako, Takashi Masuko, Alan Black, and Keiichi Tokuda. 2007. The HMM-based speech synthesis system version 2.0. In Proceedings of the Speech Synthesis Workshop. 294--299.Google Scholar
Li Zhang, Noah Snavely, Brian Curless, and Steven M Seitz. 2004. Spacetime Faces: High Resolution Capture for Modeling and Animation. In ACM Transactions on Graphics. 548--558.Google ScholarDigital Library

Index Terms

A deep learning approach for generalized speech animation
1. Computing methodologies

Recommendations

Phoneme reduction in automated speech for computer animation
SCCG '04: Proceedings of the 20th Spring Conference on Computer Graphics

Accurate facial animation is rapidly becoming a key feature in computer animation, from movie production to software agents. A lot of practical work in computer animation presently is done by 3ds max software. An advantage of this system is that one of ...
Read More
A coupled HMM approach to video-realistic speech animation

We propose a coupled hidden Markov model (CHMM) approach to video-realistic speech animation, which realizes realistic facial animations driven by speaker independent continuous speech. Different from hidden Markov model (HMM)-based animation approaches ...
Read More
Lip-synced character speech animation with dominated animeme models
SA '12: SIGGRAPH Asia 2012 Technical Briefs

One of the holy grails of computer graphics is the generation of photorealistic images with motion data. To re-generate convincing human animations might not be the most challenging part, but it is definitely one of ultimate goals for computer graphics. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Graphics Volume 36, Issue 4
August 2017
2155 pages
ISSN:0730-0301
EISSN:1557-7368
DOI:10.1145/3072959
Issue’s Table of Contents

Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 July 2017
Published in tog Volume 36, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
machine learning
speech animation
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 172
  Total Citations
  View Citations
- 3,091
  Total Downloads
- Downloads (Last 12 months)223
- Downloads (Last 6 weeks)25
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A deep learning approach for generalized speech animation

ACM Transactions on Graphics

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Phoneme reduction in automated speech for computer animation

A coupled HMM approach to video-realistic speech animation

Lip-synced character speech animation with dominated animeme models

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A deep learning approach for generalized speech animation

ACM Transactions on Graphics

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Phoneme reduction in automated speech for computer animation

A coupled HMM approach to video-realistic speech animation

Lip-synced character speech animation with dominated animeme models

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media