ABSTRACT
In this work, we show how to co-train a classifier for active speaker detection using audio-visual data. First, audio Voice Activity Detection (VAD) is used to train a personalized video-based active speaker classifier in a weakly supervised fashion. The video classifier is in turn used to train a voice model for each person. The individual voice models are then used to detect active speakers. There is no manual supervision - audio weakly supervises video classification, and the co-training loop is completed by using the trained video classifier to supervise the training of a personalized audio voice classifier.
Supplemental Material
- A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pages 92–100. ACM, 1998. Google ScholarDigital Library
- P. Chakravarty, S. Mirzaei, T. Tuytelaars, and H. Van hamme. Who’s speaking? audio-supervised classification of active speakers in video. In ACM International Conference on Multimodal Interaction (ICMI), 2015. Google ScholarDigital Library
- P. Chakravarty and T. Tuytelaars. Cross-modal supervision for learning active speaker detection in video (http://arxiv.org/abs/1603.08907v1).Google Scholar
- R. Cutler and L. Davis. Look who’s talking: Speaker detection using video and audio correlation. In IEEE International Conference on Multimedia and Expo, pages 1589–1592, 2000.Google ScholarCross Ref
- N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4):788–798, 2011. Google ScholarDigital Library
- S. Drgas and T. Virtanen. Speaker verification using adaptive dictionaries in non-negative spectrogram deconvolution. In Latent Variable Analysis and Signal Separation, pages 462–469. Springer, 2015. Google ScholarDigital Library
- M. Everingham, J. Sivic, and A. Zisserman. Taking the bite out of automated naming of characters in tv video. Image and Vision Computing, 27(5):545–559, 2009. Google ScholarDigital Library
- I. Gebru, S. Ba, G. Evangelidis, and R. Horaud. Tracking the active speaker based on a joint audio-visual observation model. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 15–21, 2015. Google ScholarDigital Library
- F. Germain, D. L. Sun, and G. J. Mysore. Speaker and noise independent voice activity detection. In INTERSPEECH, pages 732–736, 2013.Google Scholar
- R. B. Girshick, P. F. Felzenszwalb, and D. McAllester. Discriminatively trained deformable part models, release 5. http://people.cs.uchicago.edu/ rbg/latent-release5/.Google Scholar
- C. S. Greenberg, D. Bansé, G. R. Doddington, D. Garcia-Romero, J. J. Godfrey, T. Kinnunen, A. F. Martin, A. McCree, M. Przybocki, and D. A. Reynolds. The nist 2014 speaker recognition i-vector machine learning challenge. In Odyssey: The Speaker and Language Recognition Workshop, 2014.Google Scholar
- A. Hurmalainen, R. Saeidi, and T. Virtanen. Group sparsity for speaker identity discrimination in factorisation-based speech recognition. In INTERSPEECH, pages 2138–2141, 2012.Google Scholar
- A. Hurmalainen, R. Saeidi, and T. Virtanen. Noise robust speaker recognition with convolutive sparse coding. In Sixteenth Annual Conference of the International Speech Communication Association, 2015.Google Scholar
- H. Izadinia, I. Saleemi, and M. Shah. Multimodal analysis for identification and segmentation of moving-sounding objects. IEEE Transactions on Multimedia, 15(2):378–390, 2013. Google ScholarDigital Library
- D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788–791, 1999.Google ScholarCross Ref
- D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems 13, pages 556–562. MIT Press, 2001. Google ScholarDigital Library
- D. Li, C. M. Taskiran, N. Dimitrova, W. Wang, M. Li, and I. K. Sethi. Cross-modal analysis of audio-visual programs for speaker detection. In MMSP, pages 1–4, 2005.Google ScholarCross Ref
- B. Maison, C. Neti, and A. Senior. Audio-visual speaker recognition for video broadcast news. Journal of VLSI signal processing systems for signal, image and video technology, 29(1-2):71–79, 2001. Google ScholarDigital Library
- J. Ren, Y. Hu, Y.-W. Tai, C. Wang, L. Xu, W. Sun, and Q. Yan. Look, listen and learn - a multimodal lstm for speaker identification.Google Scholar
- H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, pages 3551–3558, Sydney, Australia, Dec. 2013. Google ScholarDigital Library
- J. Zegers and H. Van hamme. Joint sound source separation and speaker recognition. In Interspeech, pages 2228–2232, 2016.Google ScholarCross Ref
Index Terms
- Active speaker detection with audio-visual co-training
Recommendations
Assessing active speaker detection algorithms through the lens of automated editing
IMXw '23: Proceedings of the 2023 ACM International Conference on Interactive Media Experiences WorkshopsThis paper addresses the challenge of active speaker detection in automated video editing and highlights the limitations of current audio-only and audio-visual speaker detection methods in handling unseen data with overlapped speakers, speaker ...
Leveraging Visual Supervision for Array-Based Active Speaker Detection and Localization
Conventional audio-visual approaches for active speaker detection (ASD) typically rely on visually pre-extracted face tracks and the corresponding single-channel audio to find the speaker in a video. Therefore, they tend to fail every time the face of the ...
Comments