ABSTRACT
Person discovery in the absence of prior identity knowledge requires accurate association of visual and auditory cues. In broadcast data, multimodal analysis faces additional challenges due to narrated voices over muted scenes or dubbing in different languages. To address these challenges, we define and analyze the problem of dubbing detection in broadcast data, which has not been explored before. We propose a method to represent the temporal relationship between the auditory and visual streams. This method consists of canonical correlation analysis to learn a joint multimodal space, and long short term memory (LSTM) networks to model cross-modality temporal dependencies. Our contributions also include the introduction of a newly acquired dataset of face-speech segments from TV data, which we have made publicly available. The proposed method achieves promising performance on this real world dataset as compared to several baselines.
- M. Bendris, D. Charlet, and G. Chollet. Lip activity detection for talking faces classification in TV-Content. In ICMV, 2010.Google Scholar
- Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. on Neural Networks, 1994. Google ScholarDigital Library
- G. Chetty and M. Wagner. Audio-visual multimodal fusion for biometric person authentication and liveness verification. In NICTA-HCSNet Multimodal User Interaction Workshop, 2006. Google ScholarDigital Library
- M. Everingham, J. Sivic, and A. Zisserman. Hello! my name is... Buffy--automatic naming of characters in TV video. In BMVC, 2006.Google ScholarCross Ref
- M. Everingham, J. Sivic, and A. Zisserman. Taking the bite out of automated naming of characters in TV video. Image and Vision Computing, 2009. Google ScholarDigital Library
- G. Farneback. Two-frame motion estimation based on polynomial expansion. In Image analysis. Springer, 2003. Google ScholarDigital Library
- J. W. Fisher and T. Darrell. Speaker association with signal-level audiovisual fusion. IEEE Trans. on Multimedia, 6(3):406--413, 2004. Google ScholarDigital Library
- P. Gay, E. Khoury, S. Meignier, J.-M. Odobez, and P. Deleglise. A Conditional Random Field approach for Audio-Visual people diarization. In ICASSP, 2014.Google Scholar
- P. Gay, S. Meignier, p. Deleglise, and J.-M. Odobez. Crf based context modeling for person identification in broadcast videos. Frontiers in ICT, 3, 2016.Google Scholar
- A. Giraudel, M. Carré, V. Mapelli, J. Kahn, O. Galibert, and L. Quintard. TheREPERE Corpus: a multimodal corpus for person recognition. In LREC, 2012.Google Scholar
- J. Hershey and J. Movellan. Audio-vision: Using audio-visual synchrony to locate sounds. In NIPS, 2000.Google Scholar
- S. Hochreiter, S. Hochreiter, J. Schmidhuber, and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--80, 1997. Google ScholarDigital Library
- H. Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321--377, 1936.Google ScholarCross Ref
- Y. Hu, J. S. Ren, J. Dai, C. Yuan, L. Xu, and W. Wang. Deep multimodal speaker naming. In ACM Multimedia, 2015. Google ScholarDigital Library
- G. Iyengar, H. J. Nock, and C. Neti. Audio-visual synchrony for detection of monologues in video archives. In ICME. IEEE, 2003. Google ScholarDigital Library
- B. Jou, H. Li, J. G. Ellis, D. Morozoff-Abegauz, and S.-F. Chang. Structured exploration of who, what, when, and where in heterogeneous multimedia news sources. ACM MM, 2013. Google ScholarDigital Library
- V. Jousse, S. Petit-Renaud, S. Meignier, Y. Esteve, and C. Jacquin. Automatic named identification of speakers using diarization and ASR systems. In ICASSP, 2009. Google ScholarDigital Library
- V. Kazemi and J. Sullivan. One millisecond face alignment with an ensemble of regression trees. In CVPR, 2014. Google ScholarDigital Library
- S. Kumagai, K. Doman, T. Takahashi, D. Deguchi, I. Ide, and H. Murase. Detection of inconsistency between subject and speaker based on the co-occurrence of lip motion and voice towards speech scene extraction from news videos. In Multimedia (ISM), 2011 IEEE International Symposium on, 2011. Google ScholarDigital Library
- N. Le, D. Wu, S. Meignier, and J.-M. Odobez. Eumssi team at the mediaeval person discovery challenge. In MediaEval 2015 Workshop, 2015.Google Scholar
- C. Ma, P. Nguyen, and M. Mahajan. Finding speaker identities with a conditional maximum entropy model. In ICASSP, 2007.Google ScholarCross Ref
- A. Morris, A. Hagen, H. Glotin, and H. Bourlard. Multi-stream adaptive evidence combination for noise robust asr. Speech Communication, 2001. Google ScholarDigital Library
- J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In ICML, 2011.Google ScholarDigital Library
- H. J. Nock, G. Iyengar, and C. Neti. Assessing face and speech consistency for monologue detection in video. In ACM Multimedia, 2002. Google ScholarDigital Library
- A. Noulas, G. Englebienne, and B. J. A. Krose. Multimodal speaker diarization. TPAMI, 2012. Google ScholarDigital Library
- F. Patrona, A. Iosifidis, A. Tefas, N. Nikolaidis, and I. Pitas. Visual voice activity detection in the wild. Transactions on Multimedia, 2016.Google ScholarDigital Library
- E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy. Cuave: A new audio-visual database for multimodal human-computer interface research. In ICASSP. IEEE, 2002.Google Scholar
- L. Pigou, A. Van Den Oord, S. Dieleman, M. V. Herreweghe, and J. Dambre. Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video. Arxiv, pages 1--9, 2015.Google Scholar
- J. Poignant, L. Besacier, and G. Quénot. Unsupervised Speaker Identification in TV Broadcast Based on Written Names. IEEE/ACM Trans. on Audio, Speech, and Language Processing, 2014. Google ScholarDigital Library
- G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior. Recent advances in the automatic recognition of audiovisual speech. In Proceedings of the IEEE, pages 1306--1325, 2003.Google Scholar
- J. S. Ren, Y. Hu, Y.-W. Tai, C. Wang, L. Xu, W. Sun, and Q. Yan. Look, Listen and Learn - A Multimodal LSTM for Speaker Identification. In AAAI Conference on Artificial Intelligence, 2016.Google Scholar
- E. A. Rúa, H. Bredin, C. G. Mateo, G. Chollet, and D. G. Jiménez. Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models. Pattern Analysis and Applications, 2009.Google Scholar
- N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised Learning of Video Representations using LSTMs. Int. Conf. Machine Learning (ICML), 2015.Google Scholar
- I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014. Google ScholarDigital Library
- M. Tapaswi, O. M. Parkhi, E. Rahtu, E. Sommerlade, R. Stiefelhagen, and A. Zisserman. Total Cluster: A person agnostic clustering method for broadcast videos. In ICVGIP, 2014. Google ScholarDigital Library
Index Terms
- Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media
Recommendations
Learning Joint Multimodal Representation with Adversarial Attention Networks
MM '18: Proceedings of the 26th ACM international conference on MultimediaRecently, learning a joint representation for the multimodal data (e.g., containing both visual content and text description) has attracted extensive research interests. Usually, the features of different modalities are correlational and compositive, ...
Towards an intelligent framework for multimodal affective data analysis
An increasingly large amount of multimodal content is posted on social media websites such as YouTube and Facebook everyday. In order to cope with the growth of such so much multimodal data, there is an urgent need to develop an intelligent multi-modal ...
Bidirectional Joint Representation Learning with Symmetrical Deep Neural Networks for Multimodal and Crossmodal Applications
ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia RetrievalCommon approaches to problems involving multiple modalities (classification, retrieval, hyperlinking, etc.) are early fusion of the initial modalities and crossmodal translation from one modality to the other. Recently, deep neural networks, especially ...
Comments