ABSTRACT
Visual information from captured video is important for speaker identification under noisy conditions that have background noise or cross talk among speakers. In this paper, we propose local spatiotemporal descriptors to represent and recognize speakers based solely on visual features. Spatiotemporal dynamic texture features of local binary patterns extracted from localized mouth regions are used for describing motion information in utterances, which can capture the spatial and temporal transition characteristics. Structural edge map features are extracted from the image frames for representing appearance characteristics. Combination of dynamic texture and structural features takes both motion and appearance together into account, providing the description ability for spatiotemporal development in speech. In our experiments on BANCA and XM2VTS databases the proposed method obtained promising recognition results comparing to the other features.
- P. Aleksic and A. Katsaggelos. Audio-visual biometrics. In Proceedings of the IEEE, volume 94, pages 2025--2044, 2006.Google ScholarCross Ref
- H. Cetingul, Y. Yemez, E. Erzin, and A. Tekalp. Discriminative lip-motion features for biometric speaker identification. In Proc. of ICIP, 2004.Google ScholarCross Ref
- H. Cetingul, Y. Yemez, E. Erzin, and A. Tekalp. Robust lip-motion features for speaker identification. In Proc. of ICASSP, 2005.Google ScholarCross Ref
- D. Chow and W. Abdulla. Robust speaker identification based on perceptual log area ratio andgaussian mixture models. In INTERSPEECH-2004, 2004.Google Scholar
- D. Dean and S. Sridharana. Dynamic visual features for audio-visual speaker verification. Computer Speech & Language, 24(2). Google ScholarDigital Library
- M. Faraj and J. Bigun. Speaker and digit recognition by audio-visual lip biometrics. In Proc. of ICB, pages 1016--1024, 2007. Google ScholarDigital Library
- M. Faundez-Zanuy and A. Satue-Villar. Speaker recognition experiments on a bilingual database. In Proc. of EUSIPCO, 2006.Google Scholar
- N. Fox, R. Gross, P. Chazal, J. Cohn, and R. Reilly. Person identification using automatic integration of speech, lip and face experts. In Proc. of the 2003 ACM SIGMM workshop on Biometrics methods and applications, pages 25--32, 2003. Google ScholarDigital Library
- S. Furui. Fifty years of progress in speech and speaker recognition. Acoustical Society of America Journal, 116(4):2497--2498, 2004.Google ScholarCross Ref
- Y. Gizatdinova and V. Surakka. Feature-based detection of facial landmarks from neutral and expressive facial images. IEEE TPAMI, 28(1):135--139, 2006. Google ScholarDigital Library
- B. Kar, S. Bhatia, and P. Dutta. Audio-visual biometric based speaker identification. In Proc. of International Conference on Computational Intelligence and Multimedia Applications, pages 94--98, 2007. Google ScholarDigital Library
- M. Liu, Z. Zhang, M. Hasegawa-Johnson, and T. Huang. Exploring discriminative learning for text-independent speaker recognition. In Proc. of ICME, pages 56--59, 2007.Google ScholarCross Ref
- J. Luettin, N. Thacher, and S. Beet. Speaker identification by lipreading. In Proc. of International Conference on Spoken Language Proceedings (ICSLP), pages 62--64, 1996.Google ScholarCross Ref
- K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre. Xm2vtsdb: "the extended m2vts database". In Proc. of AVBPA, pages 72--77, 1999.Google Scholar
- C. Miyajima, Y. Hattori, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Speaker identification using gaussian mixture models based on multi-space probability distribution. In Proc. of ICASSP, pages 433--436, 2001.Google ScholarCross Ref
- Z. Niu, S. Shan, S. Yan, X. Chen, and W. Gao. 2d cascaded adaboost for eye localization. In Proc. of ICPR, pages 1216--1219, 2006. Google ScholarDigital Library
- T. Ojala, M. Pietikäinen, and T. Mäenpää. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE PAMI, 24(7):971--987, 2002. Google ScholarDigital Library
- H. Ouyang and T. Lee. A new lip feature representation method for video-based bimodal authentication. In NICTA-HCSNet Multimodal User Interaction Workshop (MMUI 2005), 2006. Google ScholarDigital Library
- G. Potamianos, H. Graf, and E. Cosatto. An image transform approach for hmm based automatic lipreading. In Proc. of ICIP, pages 173--177, 1998.Google ScholarCross Ref
- P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. of CVPR, pages 511--518, 2001.Google ScholarCross Ref
- T. Wark and S. Sridharan. Adaptive fusion of speech and lip information for robust speaker identification. Digital Signal Procession, 11:169--186, 2001.Google ScholarCross Ref
- G. Zhao, M. Barnard, and M. Pietikäinen. Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia, 11(7):1254--1265, 2009. Google ScholarDigital Library
- G. Zhao and M. Pietikäinen. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE PAMI, 29(6):915--928, 2007. Google ScholarDigital Library
- X. Zhou, Y. Fu, M. Liu, M. Hasegawa-Johnson, and T. Huang. Robust analysis and weighing on mfcc components for speech recognition and speaker identification. In Proc. of ICME, pages 188--191, 2007.Google Scholar
Index Terms
- Combining dynamic texture and structural features for speaker identification
Recommendations
Multimodal speaker/speech recognition using lip motion, lip texture and audio
Special section: Multimodal human-computer interfacesWe present a new multimodal speaker/speech recognition system that integrates audio, lip texture and lip motion modalities. Fusion of audio and face texture modalities has been investigated in the literature before. The emphasis of this work is to ...
Text-Independent/Text-Prompted Speaker Recognition by Combining Speaker-Specific GMM with Speaker Adapted Syllable-Based HMM
We presented a new text-independent/text-prompted speaker recognition method by combining speaker-specific Gaussian Mixture Model (GMM) with syllable-based HMM adapted by MLLR or MAP. The robustness of this speaker recognition method for speaking style'...
Discriminative Analysis of Lip Motion Features for Speaker Identification and Speech-Reading
There have been several studies that jointly use audio, lip intensity, and lip geometry information for speaker identification and speech-reading applications. This paper proposes using explicit lip motion information, instead of or in addition to lip ...
Comments