skip to main content
10.1145/1877972.1877996acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
poster

Combining dynamic texture and structural features for speaker identification

Published:29 October 2010Publication History

ABSTRACT

Visual information from captured video is important for speaker identification under noisy conditions that have background noise or cross talk among speakers. In this paper, we propose local spatiotemporal descriptors to represent and recognize speakers based solely on visual features. Spatiotemporal dynamic texture features of local binary patterns extracted from localized mouth regions are used for describing motion information in utterances, which can capture the spatial and temporal transition characteristics. Structural edge map features are extracted from the image frames for representing appearance characteristics. Combination of dynamic texture and structural features takes both motion and appearance together into account, providing the description ability for spatiotemporal development in speech. In our experiments on BANCA and XM2VTS databases the proposed method obtained promising recognition results comparing to the other features.

References

  1. P. Aleksic and A. Katsaggelos. Audio-visual biometrics. In Proceedings of the IEEE, volume 94, pages 2025--2044, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  2. H. Cetingul, Y. Yemez, E. Erzin, and A. Tekalp. Discriminative lip-motion features for biometric speaker identification. In Proc. of ICIP, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  3. H. Cetingul, Y. Yemez, E. Erzin, and A. Tekalp. Robust lip-motion features for speaker identification. In Proc. of ICASSP, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  4. D. Chow and W. Abdulla. Robust speaker identification based on perceptual log area ratio andgaussian mixture models. In INTERSPEECH-2004, 2004.Google ScholarGoogle Scholar
  5. D. Dean and S. Sridharana. Dynamic visual features for audio-visual speaker verification. Computer Speech & Language, 24(2). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Faraj and J. Bigun. Speaker and digit recognition by audio-visual lip biometrics. In Proc. of ICB, pages 1016--1024, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Faundez-Zanuy and A. Satue-Villar. Speaker recognition experiments on a bilingual database. In Proc. of EUSIPCO, 2006.Google ScholarGoogle Scholar
  8. N. Fox, R. Gross, P. Chazal, J. Cohn, and R. Reilly. Person identification using automatic integration of speech, lip and face experts. In Proc. of the 2003 ACM SIGMM workshop on Biometrics methods and applications, pages 25--32, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Furui. Fifty years of progress in speech and speaker recognition. Acoustical Society of America Journal, 116(4):2497--2498, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  10. Y. Gizatdinova and V. Surakka. Feature-based detection of facial landmarks from neutral and expressive facial images. IEEE TPAMI, 28(1):135--139, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. Kar, S. Bhatia, and P. Dutta. Audio-visual biometric based speaker identification. In Proc. of International Conference on Computational Intelligence and Multimedia Applications, pages 94--98, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Liu, Z. Zhang, M. Hasegawa-Johnson, and T. Huang. Exploring discriminative learning for text-independent speaker recognition. In Proc. of ICME, pages 56--59, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  13. J. Luettin, N. Thacher, and S. Beet. Speaker identification by lipreading. In Proc. of International Conference on Spoken Language Proceedings (ICSLP), pages 62--64, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  14. K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre. Xm2vtsdb: "the extended m2vts database". In Proc. of AVBPA, pages 72--77, 1999.Google ScholarGoogle Scholar
  15. C. Miyajima, Y. Hattori, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Speaker identification using gaussian mixture models based on multi-space probability distribution. In Proc. of ICASSP, pages 433--436, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  16. Z. Niu, S. Shan, S. Yan, X. Chen, and W. Gao. 2d cascaded adaboost for eye localization. In Proc. of ICPR, pages 1216--1219, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Ojala, M. Pietikäinen, and T. Mäenpää. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE PAMI, 24(7):971--987, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H. Ouyang and T. Lee. A new lip feature representation method for video-based bimodal authentication. In NICTA-HCSNet Multimodal User Interaction Workshop (MMUI 2005), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. Potamianos, H. Graf, and E. Cosatto. An image transform approach for hmm based automatic lipreading. In Proc. of ICIP, pages 173--177, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  20. P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. of CVPR, pages 511--518, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  21. T. Wark and S. Sridharan. Adaptive fusion of speech and lip information for robust speaker identification. Digital Signal Procession, 11:169--186, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  22. G. Zhao, M. Barnard, and M. Pietikäinen. Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia, 11(7):1254--1265, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. G. Zhao and M. Pietikäinen. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE PAMI, 29(6):915--928, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. X. Zhou, Y. Fu, M. Liu, M. Hasegawa-Johnson, and T. Huang. Robust analysis and weighing on mfcc components for speech recognition and speaker identification. In Proc. of ICME, pages 188--191, 2007.Google ScholarGoogle Scholar

Index Terms

  1. Combining dynamic texture and structural features for speaker identification

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MiFor '10: Proceedings of the 2nd ACM workshop on Multimedia in forensics, security and intelligence
          October 2010
          134 pages
          ISBN:9781450301572
          DOI:10.1145/1877972

          Copyright © 2010 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 29 October 2010

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • poster

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader