skip to main content
10.1145/2993148.2993172acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
short-paper

Active speaker detection with audio-visual co-training

Published:31 October 2016Publication History

ABSTRACT

In this work, we show how to co-train a classifier for active speaker detection using audio-visual data. First, audio Voice Activity Detection (VAD) is used to train a personalized video-based active speaker classifier in a weakly supervised fashion. The video classifier is in turn used to train a voice model for each person. The individual voice models are then used to detect active speakers. There is no manual supervision - audio weakly supervises video classification, and the co-training loop is completed by using the trained video classifier to supervise the training of a personalized audio voice classifier.

Skip Supplemental Material Section

Supplemental Material

p312-chakravarty-s.mp4

mp4

29.3 MB

References

  1. A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pages 92–100. ACM, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. Chakravarty, S. Mirzaei, T. Tuytelaars, and H. Van hamme. Who’s speaking? audio-supervised classification of active speakers in video. In ACM International Conference on Multimodal Interaction (ICMI), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Chakravarty and T. Tuytelaars. Cross-modal supervision for learning active speaker detection in video (http://arxiv.org/abs/1603.08907v1).Google ScholarGoogle Scholar
  4. R. Cutler and L. Davis. Look who’s talking: Speaker detection using video and audio correlation. In IEEE International Conference on Multimedia and Expo, pages 1589–1592, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  5. N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4):788–798, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Drgas and T. Virtanen. Speaker verification using adaptive dictionaries in non-negative spectrogram deconvolution. In Latent Variable Analysis and Signal Separation, pages 462–469. Springer, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Everingham, J. Sivic, and A. Zisserman. Taking the bite out of automated naming of characters in tv video. Image and Vision Computing, 27(5):545–559, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. I. Gebru, S. Ba, G. Evangelidis, and R. Horaud. Tracking the active speaker based on a joint audio-visual observation model. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 15–21, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. F. Germain, D. L. Sun, and G. J. Mysore. Speaker and noise independent voice activity detection. In INTERSPEECH, pages 732–736, 2013.Google ScholarGoogle Scholar
  10. R. B. Girshick, P. F. Felzenszwalb, and D. McAllester. Discriminatively trained deformable part models, release 5. http://people.cs.uchicago.edu/ rbg/latent-release5/.Google ScholarGoogle Scholar
  11. C. S. Greenberg, D. Bansé, G. R. Doddington, D. Garcia-Romero, J. J. Godfrey, T. Kinnunen, A. F. Martin, A. McCree, M. Przybocki, and D. A. Reynolds. The nist 2014 speaker recognition i-vector machine learning challenge. In Odyssey: The Speaker and Language Recognition Workshop, 2014.Google ScholarGoogle Scholar
  12. A. Hurmalainen, R. Saeidi, and T. Virtanen. Group sparsity for speaker identity discrimination in factorisation-based speech recognition. In INTERSPEECH, pages 2138–2141, 2012.Google ScholarGoogle Scholar
  13. A. Hurmalainen, R. Saeidi, and T. Virtanen. Noise robust speaker recognition with convolutive sparse coding. In Sixteenth Annual Conference of the International Speech Communication Association, 2015.Google ScholarGoogle Scholar
  14. H. Izadinia, I. Saleemi, and M. Shah. Multimodal analysis for identification and segmentation of moving-sounding objects. IEEE Transactions on Multimedia, 15(2):378–390, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788–791, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  16. D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems 13, pages 556–562. MIT Press, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Li, C. M. Taskiran, N. Dimitrova, W. Wang, M. Li, and I. K. Sethi. Cross-modal analysis of audio-visual programs for speaker detection. In MMSP, pages 1–4, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  18. B. Maison, C. Neti, and A. Senior. Audio-visual speaker recognition for video broadcast news. Journal of VLSI signal processing systems for signal, image and video technology, 29(1-2):71–79, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Ren, Y. Hu, Y.-W. Tai, C. Wang, L. Xu, W. Sun, and Q. Yan. Look, listen and learn - a multimodal lstm for speaker identification.Google ScholarGoogle Scholar
  20. H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, pages 3551–3558, Sydney, Australia, Dec. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Zegers and H. Van hamme. Joint sound source separation and speaker recognition. In Interspeech, pages 2228–2232, 2016.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Active speaker detection with audio-visual co-training

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal Interaction
          October 2016
          605 pages
          ISBN:9781450345569
          DOI:10.1145/2993148

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 31 October 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • short-paper

          Acceptance Rates

          Overall Acceptance Rate453of1,080submissions,42%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader