poster

Combining dynamic texture and structural features for speaker identification

Authors:
Guoying Zhao

University of Oulu, Finland, Oulu, Finland

University of Oulu, Finland, Oulu, Finland
View Profile

,
Xiaohua Huang

University of Oulu, Finland, Oulu, Finland

University of Oulu, Finland, Oulu, Finland
View Profile

,
Yulia Gizatdinova

Tampere University, Finland, Tampere, Finland

Tampere University, Finland, Tampere, Finland
View Profile

,
Matti Pietikäinen

University of Oulu, Finland, Oulu, Finland

University of Oulu, Finland, Oulu, Finland
View Profile

MiFor '10: Proceedings of the 2nd ACM workshop on Multimedia in forensics, security and intelligenceOctober 2010Pages 93–98https://doi.org/10.1145/1877972.1877996

Published:29 October 2010Publication History

MiFor '10: Proceedings of the 2nd ACM workshop on Multimedia in forensics, security and intelligence

Pages 93–98

ABSTRACT

Visual information from captured video is important for speaker identification under noisy conditions that have background noise or cross talk among speakers. In this paper, we propose local spatiotemporal descriptors to represent and recognize speakers based solely on visual features. Spatiotemporal dynamic texture features of local binary patterns extracted from localized mouth regions are used for describing motion information in utterances, which can capture the spatial and temporal transition characteristics. Structural edge map features are extracted from the image frames for representing appearance characteristics. Combination of dynamic texture and structural features takes both motion and appearance together into account, providing the description ability for spatiotemporal development in speech. In our experiments on BANCA and XM2VTS databases the proposed method obtained promising recognition results comparing to the other features.

References

P. Aleksic and A. Katsaggelos. Audio-visual biometrics. In Proceedings of the IEEE, volume 94, pages 2025--2044, 2006.Google ScholarCross Ref
H. Cetingul, Y. Yemez, E. Erzin, and A. Tekalp. Discriminative lip-motion features for biometric speaker identification. In Proc. of ICIP, 2004.Google ScholarCross Ref
H. Cetingul, Y. Yemez, E. Erzin, and A. Tekalp. Robust lip-motion features for speaker identification. In Proc. of ICASSP, 2005.Google ScholarCross Ref
D. Chow and W. Abdulla. Robust speaker identification based on perceptual log area ratio andgaussian mixture models. In INTERSPEECH-2004, 2004.Google Scholar
D. Dean and S. Sridharana. Dynamic visual features for audio-visual speaker verification. Computer Speech & Language, 24(2). Google ScholarDigital Library
M. Faraj and J. Bigun. Speaker and digit recognition by audio-visual lip biometrics. In Proc. of ICB, pages 1016--1024, 2007. Google ScholarDigital Library
M. Faundez-Zanuy and A. Satue-Villar. Speaker recognition experiments on a bilingual database. In Proc. of EUSIPCO, 2006.Google Scholar
N. Fox, R. Gross, P. Chazal, J. Cohn, and R. Reilly. Person identification using automatic integration of speech, lip and face experts. In Proc. of the 2003 ACM SIGMM workshop on Biometrics methods and applications, pages 25--32, 2003. Google ScholarDigital Library
S. Furui. Fifty years of progress in speech and speaker recognition. Acoustical Society of America Journal, 116(4):2497--2498, 2004.Google ScholarCross Ref
Y. Gizatdinova and V. Surakka. Feature-based detection of facial landmarks from neutral and expressive facial images. IEEE TPAMI, 28(1):135--139, 2006. Google ScholarDigital Library
B. Kar, S. Bhatia, and P. Dutta. Audio-visual biometric based speaker identification. In Proc. of International Conference on Computational Intelligence and Multimedia Applications, pages 94--98, 2007. Google ScholarDigital Library
M. Liu, Z. Zhang, M. Hasegawa-Johnson, and T. Huang. Exploring discriminative learning for text-independent speaker recognition. In Proc. of ICME, pages 56--59, 2007.Google ScholarCross Ref
J. Luettin, N. Thacher, and S. Beet. Speaker identification by lipreading. In Proc. of International Conference on Spoken Language Proceedings (ICSLP), pages 62--64, 1996.Google ScholarCross Ref
K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre. Xm2vtsdb: "the extended m2vts database". In Proc. of AVBPA, pages 72--77, 1999.Google Scholar
C. Miyajima, Y. Hattori, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Speaker identification using gaussian mixture models based on multi-space probability distribution. In Proc. of ICASSP, pages 433--436, 2001.Google ScholarCross Ref
Z. Niu, S. Shan, S. Yan, X. Chen, and W. Gao. 2d cascaded adaboost for eye localization. In Proc. of ICPR, pages 1216--1219, 2006. Google ScholarDigital Library
T. Ojala, M. Pietikäinen, and T. Mäenpää. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE PAMI, 24(7):971--987, 2002. Google ScholarDigital Library
H. Ouyang and T. Lee. A new lip feature representation method for video-based bimodal authentication. In NICTA-HCSNet Multimodal User Interaction Workshop (MMUI 2005), 2006. Google ScholarDigital Library
G. Potamianos, H. Graf, and E. Cosatto. An image transform approach for hmm based automatic lipreading. In Proc. of ICIP, pages 173--177, 1998.Google ScholarCross Ref
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. of CVPR, pages 511--518, 2001.Google ScholarCross Ref
T. Wark and S. Sridharan. Adaptive fusion of speech and lip information for robust speaker identification. Digital Signal Procession, 11:169--186, 2001.Google ScholarCross Ref
G. Zhao, M. Barnard, and M. Pietikäinen. Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia, 11(7):1254--1265, 2009. Google ScholarDigital Library
G. Zhao and M. Pietikäinen. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE PAMI, 29(6):915--928, 2007. Google ScholarDigital Library
X. Zhou, Y. Fu, M. Liu, M. Hasegawa-Johnson, and T. Huang. Robust analysis and weighing on mfcc components for speech recognition and speaker identification. In Proc. of ICME, pages 188--191, 2007.Google Scholar

Index Terms

Combining dynamic texture and structural features for speaker identification
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI design and evaluation methods
    2. HCI theory, concepts and models
  2. Interaction design
    1. Interaction design theory, concepts and paradigms

Recommendations

Multimodal speaker/speech recognition using lip motion, lip texture and audio
Special section: Multimodal human-computer interfaces

We present a new multimodal speaker/speech recognition system that integrates audio, lip texture and lip motion modalities. Fusion of audio and face texture modalities has been investigated in the literature before. The emphasis of this work is to ...
Read More
Text-Independent/Text-Prompted Speaker Recognition by Combining Speaker-Specific GMM with Speaker Adapted Syllable-Based HMM

We presented a new text-independent/text-prompted speaker recognition method by combining speaker-specific Gaussian Mixture Model (GMM) with syllable-based HMM adapted by MLLR or MAP. The robustness of this speaker recognition method for speaking style'...
Read More
Discriminative Analysis of Lip Motion Features for Speaker Identification and Speech-Reading

There have been several studies that jointly use audio, lip intensity, and lip geometry information for speaker identification and speech-reading applications. This paper proposes using explicit lip motion information, instead of or in addition to lip ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MiFor '10: Proceedings of the 2nd ACM workshop on Multimedia in forensics, security and intelligence
October 2010
134 pages
ISBN:9781450301572
DOI:10.1145/1877972
Program Chairs:
Sebastiano Battiato
University of Catania, Italy
,
Sabu Emmanuel
Nanyang Technological University, Singapore
,
Adrian Ulges
German Research Center for Artificial Intelligence, Germany
,
Marcel Worring
University of Amsterdam, The Netherlands
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 October 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
dynamic visual features
local spatiotemporal descriptors.
speaker identification
Qualifiers
- poster
Conference
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 170
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Combining dynamic texture and structural features for speaker identification

MiFor '10: Proceedings of the 2nd ACM workshop on Multimedia in forensics, security and intelligence

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multimodal speaker/speech recognition using lip motion, lip texture and audio

Text-Independent/Text-Prompted Speaker Recognition by Combining Speaker-Specific GMM with Speaker Adapted Syllable-Based HMM

Discriminative Analysis of Lip Motion Features for Speaker Identification and Speech-Reading