Abstract
We introduce a simple and effective deep learning approach to automatically generate natural looking speech animation that synchronizes to input speech. Our approach uses a sliding window predictor that learns arbitrary nonlinear mappings from phoneme label input sequences to mouth movements in a way that accurately captures natural motion and visual coarticulation effects. Our deep learning approach enjoys several attractive properties: it runs in real-time, requires minimal parameter tuning, generalizes well to novel input speech sequences, is easily edited to create stylized and emotional speech, and is compatible with existing animation retargeting approaches. One important focus of our work is to develop an effective approach for speech animation that can be easily integrated into existing production pipelines. We provide a detailed description of our end-to-end approach, including machine learning design decisions. Generalized speech animation results are demonstrated over a wide range of animation clips on a variety of characters and voices, including singing and foreign language input. Our approach can also generate on-demand speech animation in real-time from user speech input.
Supplemental Material
Available for Download
Supplemental files.
- Robert Anderson, Bjorn Stenger, Vincent Wan, and Roberto Cipolla. 2013. Expressive Visual Text-To-Speech Using Active Appearance Models. In Proccedings of the International Conference on Computer Vision and Pattern Recognition. 3382--3389. Google ScholarDigital Library
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
- Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. 2012. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop. (2012).Google Scholar
- Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman, Robert W Sumner, and Markus Gross. 2011. High-quality passive facial performance capture using anchor frames. ACM Transactions on Graphics 30 (Aug. 2011), 75:1--75:10. Issue 4.Google Scholar
- Matthew Brand. 1999. Voice Puppetry. In Proceedings of SIGGRAPH. ACM, 21--28. Google ScholarDigital Library
- Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: Driving Visual Speech with Audio. In Proceedings of SIGGRAPH. 353--360. Google ScholarDigital Library
- Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. 2015. Real-time high-fidelity facial performance capture. ACM Transactions on Graphics 34, 4 (2015), 46.Google ScholarDigital Library
- Chen Cao, Yanlin Weng, Stephen Lin, and Kun Zhou. 2013. 3D Shape Regression for Real-time Facial Animation. ACM Transactions on Graphics 32, 4 (2013), 41:1--41:10.Google ScholarDigital Library
- Yong Cao, Wen C Tien, Petros Faloutsos, and Frédéric Pighin. 2005. Expressive Speech-Driven Facial Animation. ACM Transactions on Graphics 24, 4 (2005), 1283 -- 1302. Google ScholarDigital Library
- Rich Caruana and Alexandru Niculescu-Mizil. 2006. An empirical comparison of supervised learning algorithms. In International Conference on Machine Learning (ICML). 161--168. Google ScholarDigital Library
- Michael M Cohen, Dominic W Massaro, and others. 1994. Modeling Coarticualtion in Synthetic Visual Speech. In Models and Techniques in Computer Animation, N.M. Thalmann and Thalmann D (Eds.). Springer-Verlag, 141--155.Google Scholar
- Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, Aug (2011), 2493--2537.Google ScholarDigital Library
- Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. 2001. Active Appearance Models. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 6 (2001), 681--685. Google ScholarDigital Library
- Eric Cosatto and Hans Peter Graf. 2000. Photo-realistic Talking-heads from Image Samples. IEEE Transactions on Multimedia 2, 3 (2000), 152--163. Google ScholarDigital Library
- José Mario De Martino, Léo Pini Magalhães, and Fábio Violaro. 2006. Facial animation based on context-dependent visemes. Journal of Computers and Graphics 30, 6 (2006), 971 -- 980. Google ScholarDigital Library
- Salil Deena, Shaobo Hou, and Aphrodite Galata. 2010. Visual speech synthesis by modelling coarticulation dynamics using a non-parametric switching state-space model. In Proceedings of the International Conference on Multimodal Interfaces. 1--8. Google ScholarDigital Library
- Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics (TOG) 35, 4 (2016), 127.Google ScholarDigital Library
- Gwenn Englebienne, Timothy F Cootes, and Magnus Rattray. 2007. A Probabilistic Model for Generating Realistic Speech Movements from Speech. In Proceedings of Advances in Natural Information Processing Systems. 401--408.Google Scholar
- Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable Videorealistic Speech Animation. In ACM Transactions on Graphics. 388--398. Google ScholarDigital Library
- Bo Fan, Lijuan Wang, Frank K Soong, and Lei Xie. 2015. Photo-real Talking Head with Deep Bidirectional LSTM. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. IEEE, 4884--4888.Google ScholarCross Ref
- Shengli Fu, Ricardo Gutierrez-Osuna, Anna Esposito, Praveen K Kakumanu, and Oscar N Garcia. 2005. Audio/visual mapping with cross-modal hidden Markov models. IEEE Transactions on Multimedia 7, 2 (2005), 243--252.Google ScholarDigital Library
- Graham Fyfe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, and Paul Debevec. 2014. Driving High-Resolution Facial Scans with Video Performance Capture. ACM Transactions on Graphics 34, 1 (2014), 8.Google Scholar
- John S Garofolo, Lori F Lamel, William M Fisher, Jonathon G Fiscus, and David S Pallett. 1993. Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM TIMIT. Technical Report 4930. NIST.Google Scholar
- Oxana Govokhina, Gérard Bailly, Gaspard Breton, and Paul Bagshaw. 2006. TDA: A new trainable trajectory formation system for facial animation. In Proceedings of Interspeech. 2474--2477.Google Scholar
- Alex Graves and Navdeep Jaitly. 2014. Towards End-To-End Speech Recognition with Recurrent Neural Networks. In ICML, Vol. 14. 1764--1772.Google ScholarDigital Library
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780. Google ScholarDigital Library
- Haoda Huang, Jinxiang Chai, Xin Tong, and Hsiang-Tao Wu. 2011. Leveraging motion capture and 3d scanning for high-fidelity facial performance acquisition. In ACM Transactions on Graphics, Vol. 30. ACM, 74. Google ScholarDigital Library
- Taehwan Kim, Yisong Yue, Sarah Taylor, and Iain Matthews. 2015. A Decision Tree Framework for Spatiotemporal Sequence Prediction. In ACM Conference on Knowledge Discovery and Data Mining. 577--586. Google ScholarDigital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems. 1097--1105.Google Scholar
- Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. 2013. Realtime Facial Animation with On-the-fly Correctives. ACM Transactions on Graphics 32, 4 (2013), 42--1. Google ScholarDigital Library
- Kang Liu and Joern Ostermann. 2012. Evaluation of an image-based talking head with realistic facial expression and head motion. Multimodal User Interfaces 5 (2012), 37--44. Google ScholarCross Ref
- Changwei Luo, Jun Yu, Xian Li, and Zengfu Wang. 2014. Realtime speech-driven facial animation using Gaussian Mixture Models. In IEEE Conference on Multimedia and Expo Workshops. 1--6.Google Scholar
- Jiyong Ma, Ron Cole, Bryan Pellom, Wayne Ward, and Barbara Wise. 2006. Accurate Visible Speech Synthesis Based on Concatenating Variable Length Motion Capture Data. IEEE Transactions on Visualization and Computer Graphics 12, 2 (2006), 266--276. Google ScholarDigital Library
- Iain Matthews and Simon Baker. 2004. Active Appearance Models Revisited. International Journal of Computer Vision 60, 2 (2004), 135--164. Google ScholarDigital Library
- Wesley Mattheyses, Lukas Latacz, and Werner Verhelst. 2013. Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis. Speech Communication 55, 7--8 (2013), 857--876.Google Scholar
- Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264 (Dec. 1976), 746--748. Google ScholarCross Ref
- Thomas Merritt and Simon King. 2013. Investigating the shortcomings of HMM synthesis. In ISCA Workshop on Speech Synthesis. 185--190.Google Scholar
- Julian James Odell. 1995. The Use of Context in Large Vocabulary Speech Recognition. Ph.D. Dissertation. Cambridge University.Google Scholar
- Dietmar Schabus, Michael Pucher, and Gregor Hofer. 2011. Simultaneous Speech and Animation Synthesis. In ACM SIGGRAPH Posters. 8:1--8:1. Google ScholarDigital Library
- Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research (JMLR) 15, 1 (2014), 1929--1958.Google ScholarDigital Library
- Robert W Sumner and Jovan Popović. 2004. Deformation transfer for triangle meshes. ACM Transactions on Graphics 23, 3 (2004), 399--405. Google ScholarDigital Library
- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Neural Information Processing Systemsw. 3104--3112.Google Scholar
- Sarah L Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic Units of Visual Speech. In Proceedings of ACM SIGGRAPH/Eurographics Symposium on Computer Animation. Eurographics Association, 275--284.Google Scholar
- Barry-John Theobald and Iain Matthews. 2012. Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers. IEEE Transactions on Audio, Speech and Language Processing 20, 8 (2012), 2378.Google ScholarDigital Library
- Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).Google Scholar
- Lijuan Wang, Wei Han, and Frank K Soong. 2012. High quality lip-sync animation for 3D photo-realistic talking head. In IEEE Conference on Acoustics, Speech and Signal Processing. IEEE, 4529--4532. Google ScholarCross Ref
- Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. 2011. Realtime Performance-based Facial Animation. In ACM Transactions on Graphics (TOG), Vol. 30. 77:1--77:10. Google ScholarDigital Library
- Yanlin Weng, Chen Cao, Qiming Hou, and Kun Zhou. 2014. Real-time facial animation on mobile devices. Graphical Models 76, 3 (2014), 172--179. Google ScholarCross Ref
- Lei Xie and Zhi-Qiang Liu. 2007. A coupled HMM approach to video-realistic speech animation. Pattern Recognition 40, 8 (2007), 2325--2340. Google ScholarDigital Library
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044 2, 3 (2015), 5.Google Scholar
- Yuyu Xu, Andrew W Feng, Stacy Marsella, and Ari Shapiro. 2013. A Practical and Configurable Lip Sync Method for Games. In Proc. ACM SIGGRAPH Motion in Games. 131--140. Google ScholarDigital Library
- Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, and others. 2006. The HTK Book. Cambridge University.Google Scholar
- Jiahong Yuan and Mark Liberman. 2008. Speaker Identification on the SCOTUS Corpus. Journal of the Acoustical Society of America 123, 5 (2008). Google ScholarCross Ref
- Heiga Zen, Takashi Nose, Junichi Yamagishi, Shinji Sako, Takashi Masuko, Alan Black, and Keiichi Tokuda. 2007. The HMM-based speech synthesis system version 2.0. In Proceedings of the Speech Synthesis Workshop. 294--299.Google Scholar
- Li Zhang, Noah Snavely, Brian Curless, and Steven M Seitz. 2004. Spacetime Faces: High Resolution Capture for Modeling and Animation. In ACM Transactions on Graphics. 548--558.Google ScholarDigital Library
Index Terms
- A deep learning approach for generalized speech animation
Recommendations
Phoneme reduction in automated speech for computer animation
SCCG '04: Proceedings of the 20th Spring Conference on Computer GraphicsAccurate facial animation is rapidly becoming a key feature in computer animation, from movie production to software agents. A lot of practical work in computer animation presently is done by 3ds max software. An advantage of this system is that one of ...
A coupled HMM approach to video-realistic speech animation
We propose a coupled hidden Markov model (CHMM) approach to video-realistic speech animation, which realizes realistic facial animations driven by speaker independent continuous speech. Different from hidden Markov model (HMM)-based animation approaches ...
Lip-synced character speech animation with dominated animeme models
SA '12: SIGGRAPH Asia 2012 Technical BriefsOne of the holy grails of computer graphics is the generation of photorealistic images with motion data. To re-generate convincing human animations might not be the most challenging part, but it is definitely one of ultimate goals for computer graphics. ...
Comments