skip to main content
Automatic lipreading to enhance speech recognition (speech reading)
Publisher:
  • University of Illinois at Urbana-Champaign
  • Champaign, IL
  • United States
Order Number:AAI8502266
Pages:
261
Bibliometrics
Skip Abstract Section
Abstract

Automatic recognition of the acoustic speech signal alone is inaccurate and computationally expensive. Additional sources of speech information, such as lipreading (or speechreading), should enhance automatic speech recognition, just as lipreading is used by humans to enhance speech recognition when the acoustic signal is degraded. This paper describes an automatic lipreading system which has been developed. A commercial device performs the acoustic speech recognition independently of the lipreading system.The recognition domain is restricted to isolated utterances and speaker dependent recognition. The speaker faces a solid state camera which sends digitized video to a minicomputer system with custom video processing hardware. The video data is sampled during an utterance and then reduced to a template consisting of visual speech parameter time sequences. The distances between the incoming template and all of the trained templates for each utterance in the vocabulary are computed and a visual recognition candidate is obtained. The combination of the acoustic and visual recognition candidates is shown to yield a final recognition accuracy which greatly exceeds the acoustic recognition accuracy alone. Practical considerations and the possible enhancement of speaker independent and continuous speech recognition systems are also discussed.

Cited By

  1. Li Y, Ren J, Wang Y, Wang G, Li X and Liu H (2023). Audio–visual keyword transformer for unconstrained sentence‐level keyword spotting, CAAI Transactions on Intelligence Technology, 9:1, (142-152), Online publication date: 13-Feb-2024.
  2. ACM
    Su Z, Zhang X, Kimura N and Rekimoto J Gaze+Lip: Rapid, Precise and Expressive Interactions Combining Gaze Input and Silent Speech Commands for Hands-free Smart TV Control ACM Symposium on Eye Tracking Research and Applications, (1-6)
  3. ACM
    Shang D, Zhang X, Xu X and Peng X Speaker Recognition Based on Lip-reading Proceedings of the 2018 VII International Conference on Network, Communication and Computing, (247-251)
  4. ACM
    Jain A and Rathna G Lip Reading using Simple Dynamic Features and a Novel ROI for Feature Extraction Proceedings of the 2018 International Conference on Signal Processing and Machine Learning, (73-77)
  5. Wand M, Schmidhuber J and Vu N Investigations on End- to-End Audiovisual Fusion 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (3041-3045)
  6. Harte N and Gillen E (2015). TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech, IEEE Transactions on Multimedia, 17:5, (603-615), Online publication date: 1-May-2015.
  7. Zhang Y, Liu Q, Li Y and Li Z Intelligent wheelchair multi-modal human-machine interfaces in lip contour extraction based on PMM Proceedings of the 2009 international conference on Robotics and biomimetics, (2108-2113)
  8. Li M and Cheung Y (2009). Automatic lip localization under face illumination with shadow consideration, Signal Processing, 89:12, (2425-2434), Online publication date: 1-Dec-2009.
  9. Patel P and Ouazzane K Comparison of fixed and variable weight approaches for viseme classification Proceedings of the Ninth IASTED International Conference on Signal and Image Processing, (119-122)
  10. Hong X, Yao H, Liu Q and Chen R An information acquiring channel —— lip movement Proceedings of the First international conference on Affective Computing and Intelligent Interaction, (232-238)
  11. Dong L, Foo S and Lian Y (2005). A two-channel training algorithm for hidden Markov model and its application to lip reading, EURASIP Journal on Advances in Signal Processing, 2005, (1382-1399), Online publication date: 1-Jan-2005.
  12. Oviatt S (2003). Advances in Robust Multimodal Interface Design, IEEE Computer Graphics and Applications, 23:5, (62-68), Online publication date: 1-Sep-2003.
  13. Matthews I, Cootes T, Bangham J, Cox S and Harvey R (2002). Extraction of Visual Features for Lipreading, IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:2, (198-213), Online publication date: 1-Feb-2002.
  14. Oviatt S Multimodal interfaces The human-computer interaction handbook, (286-304)
  15. Zhang X, Broun C, Mersereau R and Clements M (2002). Automatic speechreading with applications to human-computer interfaces, EURASIP Journal on Advances in Signal Processing, 2002:1, (1228-1247), Online publication date: 1-Jan-2002.
  16. ACM
    Petajan E, Bischoff B, Bodoff D and Brooke N An improved automatic lipreading system to enhance speech recognition Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, (19-25)
  17. ACM
    Nishida S Speech recognition enhancement by lip information Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, (198-204)
  18. ACM
    Nishida S (1986). Speech recognition enhancement by lip information, ACM SIGCHI Bulletin, 17:4, (198-204), Online publication date: 1-Apr-1986.
Contributors
  • Nokia Bell Labs

Recommendations