skip to main content
10.1145/2683483.2683530acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicvgipConference Proceedingsconference-collections
research-article

3D Visual Speech Animation from Image Sequences

Authors Info & Claims
Published:14 December 2014Publication History

ABSTRACT

In this paper we describe an early version of our system which synthesizes 3D visual speech including tongue and teeth from frontal facial image sequences. This system is developed for 3D Visual Speech Animation (VSA) using images generated by an existing state-of-the-art image-based VSA system. In fact, the prime motivation for this system is to have a 3D VSA system from limited amount of training data when compared to that required for developing a conventional corpus based 3D VSA system. It consists of two modules. The first module iteratively estimates the 3D shape of the external facial surface for each image in the input sequence. The second module complements the external face with 3D tongue and teeth to complete the perceptually crucial visual speech information. This has the added advantages of 3D visual speech, which are renderability of the face in different poses and illumination conditions and, enhanced visual information of tongue and teeth. The first module for 3D shape estimation is based on the detection of facial landmarks in images. It uses a prior 3D Morphable Model (3D-MM) trained using 3D facial data. For the time being it is developed for a person-specific domain, i.e., the 3D-MM and the 2D facial landmark detector are trained using the data of a single person and tested with the same person-specific data. The estimated 3D shape sequences are provided as input to the second module along with the phonetic segmentation. For any particular 3D shape, tongue and teeth information is generated by rotating the lower jaw based on few skin points on the jaw and animating a rigid 3D tongue through keyframe interpolation.

References

  1. R. Anderson, B. Stenger, V. Wan, and R. Cipolla. An expressive text-driven 3D talking head. In SIGGRAPH 2013 Posters. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic. Robust discriminative response map fitting with constrained local models. In CVPR 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Badin, G. Bailly, L. Revéret, M. Baciu, C. Segebarth, and C. Savariaux. Three-dimensional linear articulatory modeling of tongue, lips and face, based on MRI and video images. J. Phon. 2002.Google ScholarGoogle Scholar
  4. V. Blanz, C. Basso, T. Poggio, and T. Vetter. Reanimating faces in images and video. Comput. Graph. Forum 2003.Google ScholarGoogle Scholar
  5. V. Blanz and T. Vetter. A morphable model for the synthesis of 3D faces. In SIGGRAPH 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. M. Cohen and D. W. Massaro. Modeling coarticulation in synthetic visual speech. In Models and Techniques in Computer Animation. Springer-Verlag, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  7. T. Cootes, G. Edwards, and C. Taylor. Active appearance models. TPAMI 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Cristinacce and T. Cootes. Facial feature detection and tracking with automatic template selection. In AFGR 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. dmitrij leppée. Teeth model set. http://www.badking.com.au/site/shop/medical-models/human-teethby-dmitrij-leppee/.Google ScholarGoogle Scholar
  10. O. Engwall. Combining MRI, EMA and EPG measurements in a three-dimensional tongue model. Speech Communication 2003.Google ScholarGoogle Scholar
  11. G. Fanelli, J. Gall, H. Romsdorfer, T. Weise, and L. V. Gool. A 3-D audio-visual corpus of affective communication. IEEE Trans. Multimedia 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Gibert, V. Attina, M. Tiede, R. Bundgaard-Nielsen, C. Kroos, B. Kasisopa, E. Vatikiotis-Bateson, and C. Best. Multimodal speech animation from electromagnetic articulography data. In EUSIPCO 2012.Google ScholarGoogle Scholar
  13. J. Hesch and S. Roumeliotis. A direct least-squares (DLS) method for PnP. In ICCV 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. D. Ilie, C. Negrescu, and D. Stanomir. An efficient parametric model for real-time 3D tongue skeletal animation. In ICCIT 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. N. H. Kassab. The selection of maxillary anterior teeth width in relation to facial measurements at different types of face form. Al-Rafidain Dental Journal, 2005.Google ScholarGoogle Scholar
  16. I. Kemelmacher-Shlizerman and R. Basri. 3D face reconstruction from a single image using a single reference face shape. TPAMI 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. A. King and R. E. Parent. A 3D parametric tongue model for animated speech. J. Visual. Comput. Animat. 2001.Google ScholarGoogle Scholar
  18. M. D. Levine and Y. C. Yu. State-of-the-art of 3D facial reconstruction methods for face recognition based on a single 2D training image per person. Pattern Recogn. Lett. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. I. Matthews and S. Baker. Active appearance models revisited. IJCV 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. E. Murphy-Chutorian and M. Trivedi. Head pose estimation in computer vision: A survey. TPAMI 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Ouni, L. Mangeonjean, and I. Steiner. VisArtico: a visualization tool for articulatory data. In INTERSPEECH 2012.Google ScholarGoogle Scholar
  22. C. Pelachaud, C. van Overveld, and C. Seah. Modeling and animating the human tongue during speech production. In Proc. Computer Animation, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  23. C. Qin and M. Carreira-Perpinan. Reconstructing the full tongue contour from EMA/X-ray microbeam. In ICASSP 2010.Google ScholarGoogle ScholarCross RefCross Ref
  24. J. Saragih, S. Lucey, and J. Cohn. Deformable model fitting by regularized landmark mean-shift. IJCV 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. I. Steiner, K. Richmond, and S. Ouni. Speech animation using electromagnetic articulography as motion capture data. In AVSP 2013.Google ScholarGoogle Scholar
  26. L. Vezzaro. ICAAM - inverse compositional active appearance models. http://sourceforge.net/projects/icaam/.Google ScholarGoogle Scholar
  27. F. Vogt, J. E. Lloyd, S. Buchaillard, P. Perrier, M. Chabanas, Y. Payan, and S. S. Fels. An efficient biomechanical tongue model for speech research. In ISSP 2006.Google ScholarGoogle Scholar
  28. A. Wrench. The MOCHA-TIMIT articulatory database. http://www.cstr.ed.ac.uk/research/projects/artic/mocha.html, 1999.Google ScholarGoogle Scholar
  29. M.-H. Yang, D. Kriegman, and N. Ahuja. Detecting faces in images: a survey. TPAMI 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Z. Zhang, Z. Liu, D. Adler, M. F. Cohen, E. Hanson, and Y. Shan. Robust and rapid generation of animated faces from video images: A model-based modeling approach. IJCV 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Z. Zhou, G. Zhao, Y. Guo, and M. Pietikäinen. An image-based visual speech animation system. IEEE Trans. Circuits Syst. Video Technol. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Z. Zhou, G. Zhao, X. Hong, and M. Pietikäinen. A review of recent advances in visual speech decoding. Image and Vision Computing 2014.Google ScholarGoogle Scholar
  33. X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In CVPR 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. 3D Visual Speech Animation from Image Sequences

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ICVGIP '14: Proceedings of the 2014 Indian Conference on Computer Vision Graphics and Image Processing
      December 2014
      692 pages
      ISBN:9781450330619
      DOI:10.1145/2683483

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 14 December 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate95of286submissions,33%
    • Article Metrics

      • Downloads (Last 12 months)1
      • Downloads (Last 6 weeks)0

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader