ABSTRACT
In this paper we describe an early version of our system which synthesizes 3D visual speech including tongue and teeth from frontal facial image sequences. This system is developed for 3D Visual Speech Animation (VSA) using images generated by an existing state-of-the-art image-based VSA system. In fact, the prime motivation for this system is to have a 3D VSA system from limited amount of training data when compared to that required for developing a conventional corpus based 3D VSA system. It consists of two modules. The first module iteratively estimates the 3D shape of the external facial surface for each image in the input sequence. The second module complements the external face with 3D tongue and teeth to complete the perceptually crucial visual speech information. This has the added advantages of 3D visual speech, which are renderability of the face in different poses and illumination conditions and, enhanced visual information of tongue and teeth. The first module for 3D shape estimation is based on the detection of facial landmarks in images. It uses a prior 3D Morphable Model (3D-MM) trained using 3D facial data. For the time being it is developed for a person-specific domain, i.e., the 3D-MM and the 2D facial landmark detector are trained using the data of a single person and tested with the same person-specific data. The estimated 3D shape sequences are provided as input to the second module along with the phonetic segmentation. For any particular 3D shape, tongue and teeth information is generated by rotating the lower jaw based on few skin points on the jaw and animating a rigid 3D tongue through keyframe interpolation.
- R. Anderson, B. Stenger, V. Wan, and R. Cipolla. An expressive text-driven 3D talking head. In SIGGRAPH 2013 Posters. Google ScholarDigital Library
- A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic. Robust discriminative response map fitting with constrained local models. In CVPR 2013. Google ScholarDigital Library
- P. Badin, G. Bailly, L. Revéret, M. Baciu, C. Segebarth, and C. Savariaux. Three-dimensional linear articulatory modeling of tongue, lips and face, based on MRI and video images. J. Phon. 2002.Google Scholar
- V. Blanz, C. Basso, T. Poggio, and T. Vetter. Reanimating faces in images and video. Comput. Graph. Forum 2003.Google Scholar
- V. Blanz and T. Vetter. A morphable model for the synthesis of 3D faces. In SIGGRAPH 1999. Google ScholarDigital Library
- M. M. Cohen and D. W. Massaro. Modeling coarticulation in synthetic visual speech. In Models and Techniques in Computer Animation. Springer-Verlag, 1993.Google ScholarCross Ref
- T. Cootes, G. Edwards, and C. Taylor. Active appearance models. TPAMI 2001. Google ScholarDigital Library
- D. Cristinacce and T. Cootes. Facial feature detection and tracking with automatic template selection. In AFGR 2006. Google ScholarDigital Library
- dmitrij leppée. Teeth model set. http://www.badking.com.au/site/shop/medical-models/human-teethby-dmitrij-leppee/.Google Scholar
- O. Engwall. Combining MRI, EMA and EPG measurements in a three-dimensional tongue model. Speech Communication 2003.Google Scholar
- G. Fanelli, J. Gall, H. Romsdorfer, T. Weise, and L. V. Gool. A 3-D audio-visual corpus of affective communication. IEEE Trans. Multimedia 2010. Google ScholarDigital Library
- G. Gibert, V. Attina, M. Tiede, R. Bundgaard-Nielsen, C. Kroos, B. Kasisopa, E. Vatikiotis-Bateson, and C. Best. Multimodal speech animation from electromagnetic articulography data. In EUSIPCO 2012.Google Scholar
- J. Hesch and S. Roumeliotis. A direct least-squares (DLS) method for PnP. In ICCV 2011. Google ScholarDigital Library
- M. D. Ilie, C. Negrescu, and D. Stanomir. An efficient parametric model for real-time 3D tongue skeletal animation. In ICCIT 2012. Google ScholarDigital Library
- N. H. Kassab. The selection of maxillary anterior teeth width in relation to facial measurements at different types of face form. Al-Rafidain Dental Journal, 2005.Google Scholar
- I. Kemelmacher-Shlizerman and R. Basri. 3D face reconstruction from a single image using a single reference face shape. TPAMI 2011. Google ScholarDigital Library
- S. A. King and R. E. Parent. A 3D parametric tongue model for animated speech. J. Visual. Comput. Animat. 2001.Google Scholar
- M. D. Levine and Y. C. Yu. State-of-the-art of 3D facial reconstruction methods for face recognition based on a single 2D training image per person. Pattern Recogn. Lett. 2009. Google ScholarDigital Library
- I. Matthews and S. Baker. Active appearance models revisited. IJCV 2004. Google ScholarDigital Library
- E. Murphy-Chutorian and M. Trivedi. Head pose estimation in computer vision: A survey. TPAMI 2009. Google ScholarDigital Library
- S. Ouni, L. Mangeonjean, and I. Steiner. VisArtico: a visualization tool for articulatory data. In INTERSPEECH 2012.Google Scholar
- C. Pelachaud, C. van Overveld, and C. Seah. Modeling and animating the human tongue during speech production. In Proc. Computer Animation, 1994.Google ScholarCross Ref
- C. Qin and M. Carreira-Perpinan. Reconstructing the full tongue contour from EMA/X-ray microbeam. In ICASSP 2010.Google ScholarCross Ref
- J. Saragih, S. Lucey, and J. Cohn. Deformable model fitting by regularized landmark mean-shift. IJCV 2011. Google ScholarDigital Library
- I. Steiner, K. Richmond, and S. Ouni. Speech animation using electromagnetic articulography as motion capture data. In AVSP 2013.Google Scholar
- L. Vezzaro. ICAAM - inverse compositional active appearance models. http://sourceforge.net/projects/icaam/.Google Scholar
- F. Vogt, J. E. Lloyd, S. Buchaillard, P. Perrier, M. Chabanas, Y. Payan, and S. S. Fels. An efficient biomechanical tongue model for speech research. In ISSP 2006.Google Scholar
- A. Wrench. The MOCHA-TIMIT articulatory database. http://www.cstr.ed.ac.uk/research/projects/artic/mocha.html, 1999.Google Scholar
- M.-H. Yang, D. Kriegman, and N. Ahuja. Detecting faces in images: a survey. TPAMI 2002. Google ScholarDigital Library
- Z. Zhang, Z. Liu, D. Adler, M. F. Cohen, E. Hanson, and Y. Shan. Robust and rapid generation of animated faces from video images: A model-based modeling approach. IJCV 2004. Google ScholarDigital Library
- Z. Zhou, G. Zhao, Y. Guo, and M. Pietikäinen. An image-based visual speech animation system. IEEE Trans. Circuits Syst. Video Technol. 2012. Google ScholarDigital Library
- Z. Zhou, G. Zhao, X. Hong, and M. Pietikäinen. A review of recent advances in visual speech decoding. Image and Vision Computing 2014.Google Scholar
- X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In CVPR 2012. Google ScholarDigital Library
Index Terms
- 3D Visual Speech Animation from Image Sequences
Recommendations
Facial 3D Shape Estimation from Images for Visual Speech Animation
ICPR '14: Proceedings of the 2014 22nd International Conference on Pattern RecognitionIn this paper we describe the first version of our system for estimating 3D shape sequences from images of the frontal face. This approach is developed with 3D Visual Speech Animation (VSA) as the target application. In particular, the focus is on the ...
Fully automatic face normalization and single sample face recognition in unconstrained environments
We present a fully automatic face normalization and recognition system.It normalizes the face images for both in-plane and out-of-plane pose variations.The performance of AAM fitting is improved using a novel initialization technique.HOG and Gabor ...
Subtle facial expression recognition using motion magnification
This paper proposes a novel method for subtle facial expression recognition that uses motion magnification to transform subtle expressions into corresponding exaggerated ones. Motion magnification consists of four steps: First, active appearance model (...
Comments