skip to main content
10.1145/2964284.2967211acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media

Authors Info & Claims
Published:01 October 2016Publication History

ABSTRACT

Person discovery in the absence of prior identity knowledge requires accurate association of visual and auditory cues. In broadcast data, multimodal analysis faces additional challenges due to narrated voices over muted scenes or dubbing in different languages. To address these challenges, we define and analyze the problem of dubbing detection in broadcast data, which has not been explored before. We propose a method to represent the temporal relationship between the auditory and visual streams. This method consists of canonical correlation analysis to learn a joint multimodal space, and long short term memory (LSTM) networks to model cross-modality temporal dependencies. Our contributions also include the introduction of a newly acquired dataset of face-speech segments from TV data, which we have made publicly available. The proposed method achieves promising performance on this real world dataset as compared to several baselines.

References

  1. M. Bendris, D. Charlet, and G. Chollet. Lip activity detection for talking faces classification in TV-Content. In ICMV, 2010.Google ScholarGoogle Scholar
  2. Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. on Neural Networks, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. Chetty and M. Wagner. Audio-visual multimodal fusion for biometric person authentication and liveness verification. In NICTA-HCSNet Multimodal User Interaction Workshop, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Everingham, J. Sivic, and A. Zisserman. Hello! my name is... Buffy--automatic naming of characters in TV video. In BMVC, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  5. M. Everingham, J. Sivic, and A. Zisserman. Taking the bite out of automated naming of characters in TV video. Image and Vision Computing, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. Farneback. Two-frame motion estimation based on polynomial expansion. In Image analysis. Springer, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. W. Fisher and T. Darrell. Speaker association with signal-level audiovisual fusion. IEEE Trans. on Multimedia, 6(3):406--413, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Gay, E. Khoury, S. Meignier, J.-M. Odobez, and P. Deleglise. A Conditional Random Field approach for Audio-Visual people diarization. In ICASSP, 2014.Google ScholarGoogle Scholar
  9. P. Gay, S. Meignier, p. Deleglise, and J.-M. Odobez. Crf based context modeling for person identification in broadcast videos. Frontiers in ICT, 3, 2016.Google ScholarGoogle Scholar
  10. A. Giraudel, M. Carré, V. Mapelli, J. Kahn, O. Galibert, and L. Quintard. TheREPERE Corpus: a multimodal corpus for person recognition. In LREC, 2012.Google ScholarGoogle Scholar
  11. J. Hershey and J. Movellan. Audio-vision: Using audio-visual synchrony to locate sounds. In NIPS, 2000.Google ScholarGoogle Scholar
  12. S. Hochreiter, S. Hochreiter, J. Schmidhuber, and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--80, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. H. Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321--377, 1936.Google ScholarGoogle ScholarCross RefCross Ref
  14. Y. Hu, J. S. Ren, J. Dai, C. Yuan, L. Xu, and W. Wang. Deep multimodal speaker naming. In ACM Multimedia, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. Iyengar, H. J. Nock, and C. Neti. Audio-visual synchrony for detection of monologues in video archives. In ICME. IEEE, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Jou, H. Li, J. G. Ellis, D. Morozoff-Abegauz, and S.-F. Chang. Structured exploration of who, what, when, and where in heterogeneous multimedia news sources. ACM MM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. V. Jousse, S. Petit-Renaud, S. Meignier, Y. Esteve, and C. Jacquin. Automatic named identification of speakers using diarization and ASR systems. In ICASSP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. V. Kazemi and J. Sullivan. One millisecond face alignment with an ensemble of regression trees. In CVPR, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Kumagai, K. Doman, T. Takahashi, D. Deguchi, I. Ide, and H. Murase. Detection of inconsistency between subject and speaker based on the co-occurrence of lip motion and voice towards speech scene extraction from news videos. In Multimedia (ISM), 2011 IEEE International Symposium on, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. N. Le, D. Wu, S. Meignier, and J.-M. Odobez. Eumssi team at the mediaeval person discovery challenge. In MediaEval 2015 Workshop, 2015.Google ScholarGoogle Scholar
  21. C. Ma, P. Nguyen, and M. Mahajan. Finding speaker identities with a conditional maximum entropy model. In ICASSP, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  22. A. Morris, A. Hagen, H. Glotin, and H. Bourlard. Multi-stream adaptive evidence combination for noise robust asr. Speech Communication, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In ICML, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. H. J. Nock, G. Iyengar, and C. Neti. Assessing face and speech consistency for monologue detection in video. In ACM Multimedia, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Noulas, G. Englebienne, and B. J. A. Krose. Multimodal speaker diarization. TPAMI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. F. Patrona, A. Iosifidis, A. Tefas, N. Nikolaidis, and I. Pitas. Visual voice activity detection in the wild. Transactions on Multimedia, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy. Cuave: A new audio-visual database for multimodal human-computer interface research. In ICASSP. IEEE, 2002.Google ScholarGoogle Scholar
  28. L. Pigou, A. Van Den Oord, S. Dieleman, M. V. Herreweghe, and J. Dambre. Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video. Arxiv, pages 1--9, 2015.Google ScholarGoogle Scholar
  29. J. Poignant, L. Besacier, and G. Quénot. Unsupervised Speaker Identification in TV Broadcast Based on Written Names. IEEE/ACM Trans. on Audio, Speech, and Language Processing, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior. Recent advances in the automatic recognition of audiovisual speech. In Proceedings of the IEEE, pages 1306--1325, 2003.Google ScholarGoogle Scholar
  31. J. S. Ren, Y. Hu, Y.-W. Tai, C. Wang, L. Xu, W. Sun, and Q. Yan. Look, Listen and Learn - A Multimodal LSTM for Speaker Identification. In AAAI Conference on Artificial Intelligence, 2016.Google ScholarGoogle Scholar
  32. E. A. Rúa, H. Bredin, C. G. Mateo, G. Chollet, and D. G. Jiménez. Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models. Pattern Analysis and Applications, 2009.Google ScholarGoogle Scholar
  33. N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised Learning of Video Representations using LSTMs. Int. Conf. Machine Learning (ICML), 2015.Google ScholarGoogle Scholar
  34. I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Tapaswi, O. M. Parkhi, E. Rahtu, E. Sommerlade, R. Stiefelhagen, and A. Zisserman. Total Cluster: A person agnostic clustering method for broadcast videos. In ICVGIP, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '16: Proceedings of the 24th ACM international conference on Multimedia
      October 2016
      1542 pages
      ISBN:9781450336031
      DOI:10.1145/2964284

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 October 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper

      Acceptance Rates

      MM '16 Paper Acceptance Rate52of237submissions,22%Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader