short-paper

Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media

Authors:
Nam Le

Idiap Research Institute & Ecole Polytechnique Federal de Lausanne, Martigny & Lausanne, Switzerland

Idiap Research Institute & Ecole Polytechnique Federal de Lausanne, Martigny & Lausanne, Switzerland
View Profile

,
Jean-Marc Odobez

Idiap Research Institute & Ecole Polytechnique Federal de Lausanne, Martigny & Lausanne, Switzerland

Idiap Research Institute & Ecole Polytechnique Federal de Lausanne, Martigny & Lausanne, Switzerland
View Profile

MM '16: Proceedings of the 24th ACM international conference on MultimediaOctober 2016Pages 202–206https://doi.org/10.1145/2964284.2967211

Published:01 October 2016Publication History

MM '16: Proceedings of the 24th ACM international conference on Multimedia

Pages 202–206

ABSTRACT

Person discovery in the absence of prior identity knowledge requires accurate association of visual and auditory cues. In broadcast data, multimodal analysis faces additional challenges due to narrated voices over muted scenes or dubbing in different languages. To address these challenges, we define and analyze the problem of dubbing detection in broadcast data, which has not been explored before. We propose a method to represent the temporal relationship between the auditory and visual streams. This method consists of canonical correlation analysis to learn a joint multimodal space, and long short term memory (LSTM) networks to model cross-modality temporal dependencies. Our contributions also include the introduction of a newly acquired dataset of face-speech segments from TV data, which we have made publicly available. The proposed method achieves promising performance on this real world dataset as compared to several baselines.

References

M. Bendris, D. Charlet, and G. Chollet. Lip activity detection for talking faces classification in TV-Content. In ICMV, 2010.Google Scholar
Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. on Neural Networks, 1994. Google ScholarDigital Library
G. Chetty and M. Wagner. Audio-visual multimodal fusion for biometric person authentication and liveness verification. In NICTA-HCSNet Multimodal User Interaction Workshop, 2006. Google ScholarDigital Library
M. Everingham, J. Sivic, and A. Zisserman. Hello! my name is... Buffy--automatic naming of characters in TV video. In BMVC, 2006.Google ScholarCross Ref
M. Everingham, J. Sivic, and A. Zisserman. Taking the bite out of automated naming of characters in TV video. Image and Vision Computing, 2009. Google ScholarDigital Library
G. Farneback. Two-frame motion estimation based on polynomial expansion. In Image analysis. Springer, 2003. Google ScholarDigital Library
J. W. Fisher and T. Darrell. Speaker association with signal-level audiovisual fusion. IEEE Trans. on Multimedia, 6(3):406--413, 2004. Google ScholarDigital Library
P. Gay, E. Khoury, S. Meignier, J.-M. Odobez, and P. Deleglise. A Conditional Random Field approach for Audio-Visual people diarization. In ICASSP, 2014.Google Scholar
P. Gay, S. Meignier, p. Deleglise, and J.-M. Odobez. Crf based context modeling for person identification in broadcast videos. Frontiers in ICT, 3, 2016.Google Scholar
A. Giraudel, M. Carré, V. Mapelli, J. Kahn, O. Galibert, and L. Quintard. TheREPERE Corpus: a multimodal corpus for person recognition. In LREC, 2012.Google Scholar
J. Hershey and J. Movellan. Audio-vision: Using audio-visual synchrony to locate sounds. In NIPS, 2000.Google Scholar
S. Hochreiter, S. Hochreiter, J. Schmidhuber, and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--80, 1997. Google ScholarDigital Library
H. Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321--377, 1936.Google ScholarCross Ref
Y. Hu, J. S. Ren, J. Dai, C. Yuan, L. Xu, and W. Wang. Deep multimodal speaker naming. In ACM Multimedia, 2015. Google ScholarDigital Library
G. Iyengar, H. J. Nock, and C. Neti. Audio-visual synchrony for detection of monologues in video archives. In ICME. IEEE, 2003. Google ScholarDigital Library
B. Jou, H. Li, J. G. Ellis, D. Morozoff-Abegauz, and S.-F. Chang. Structured exploration of who, what, when, and where in heterogeneous multimedia news sources. ACM MM, 2013. Google ScholarDigital Library
V. Jousse, S. Petit-Renaud, S. Meignier, Y. Esteve, and C. Jacquin. Automatic named identification of speakers using diarization and ASR systems. In ICASSP, 2009. Google ScholarDigital Library
V. Kazemi and J. Sullivan. One millisecond face alignment with an ensemble of regression trees. In CVPR, 2014. Google ScholarDigital Library
S. Kumagai, K. Doman, T. Takahashi, D. Deguchi, I. Ide, and H. Murase. Detection of inconsistency between subject and speaker based on the co-occurrence of lip motion and voice towards speech scene extraction from news videos. In Multimedia (ISM), 2011 IEEE International Symposium on, 2011. Google ScholarDigital Library
N. Le, D. Wu, S. Meignier, and J.-M. Odobez. Eumssi team at the mediaeval person discovery challenge. In MediaEval 2015 Workshop, 2015.Google Scholar
C. Ma, P. Nguyen, and M. Mahajan. Finding speaker identities with a conditional maximum entropy model. In ICASSP, 2007.Google ScholarCross Ref
A. Morris, A. Hagen, H. Glotin, and H. Bourlard. Multi-stream adaptive evidence combination for noise robust asr. Speech Communication, 2001. Google ScholarDigital Library
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In ICML, 2011.Google ScholarDigital Library
H. J. Nock, G. Iyengar, and C. Neti. Assessing face and speech consistency for monologue detection in video. In ACM Multimedia, 2002. Google ScholarDigital Library
A. Noulas, G. Englebienne, and B. J. A. Krose. Multimodal speaker diarization. TPAMI, 2012. Google ScholarDigital Library
F. Patrona, A. Iosifidis, A. Tefas, N. Nikolaidis, and I. Pitas. Visual voice activity detection in the wild. Transactions on Multimedia, 2016.Google ScholarDigital Library
E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy. Cuave: A new audio-visual database for multimodal human-computer interface research. In ICASSP. IEEE, 2002.Google Scholar
L. Pigou, A. Van Den Oord, S. Dieleman, M. V. Herreweghe, and J. Dambre. Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video. Arxiv, pages 1--9, 2015.Google Scholar
J. Poignant, L. Besacier, and G. Quénot. Unsupervised Speaker Identification in TV Broadcast Based on Written Names. IEEE/ACM Trans. on Audio, Speech, and Language Processing, 2014. Google ScholarDigital Library
G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior. Recent advances in the automatic recognition of audiovisual speech. In Proceedings of the IEEE, pages 1306--1325, 2003.Google Scholar
J. S. Ren, Y. Hu, Y.-W. Tai, C. Wang, L. Xu, W. Sun, and Q. Yan. Look, Listen and Learn - A Multimodal LSTM for Speaker Identification. In AAAI Conference on Artificial Intelligence, 2016.Google Scholar
E. A. Rúa, H. Bredin, C. G. Mateo, G. Chollet, and D. G. Jiménez. Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models. Pattern Analysis and Applications, 2009.Google Scholar
N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised Learning of Video Representations using LSTMs. Int. Conf. Machine Learning (ICML), 2015.Google Scholar
I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014. Google ScholarDigital Library
M. Tapaswi, O. M. Parkhi, E. Rahtu, E. Sommerlade, R. Stiefelhagen, and A. Zisserman. Total Cluster: A person agnostic clustering method for broadcast videos. In ICVGIP, 2014. Google ScholarDigital Library

Index Terms

Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Learning Joint Multimodal Representation with Adversarial Attention Networks
MM '18: Proceedings of the 26th ACM international conference on Multimedia

Recently, learning a joint representation for the multimodal data (e.g., containing both visual content and text description) has attracted extensive research interests. Usually, the features of different modalities are correlational and compositive, ...
Read More
Towards an intelligent framework for multimodal affective data analysis

An increasingly large amount of multimodal content is posted on social media websites such as YouTube and Facebook everyday. In order to cope with the growth of such so much multimodal data, there is an urgent need to develop an intelligent multi-modal ...
Read More
Bidirectional Joint Representation Learning with Symmetrical Deep Neural Networks for Multimodal and Crossmodal Applications
ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval

Common approaches to problems involving multiple modalities (classification, retrieval, hyperlinking, etc.) are early fusion of the initial modalities and crossmodal translation from one modality to the other. Recently, deep neural networks, especially ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '16: Proceedings of the 24th ACM international conference on Multimedia
October 2016
1542 pages
ISBN:9781450336031
DOI:10.1145/2964284
General Chairs:
Alan Hanjalic
Delft University of Technology
,
Cees Snoek
Qualcomm Research Netherlands / University of Amsterdam
,
Marcel Worring
University of Amsterdam
,
Moderator:
Dick Bulterman
CWI / VU University Amsterdam
,
Program Chairs:
Benoit Huet
EURECOM
,
Aisling Kelliher
Virginia Tech
,
Yiannis Kompatsiaris
CERTH-ITI
,
Jin Li
Microsoft
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 October 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
multimodal
person diarization
recurrent neural networks
Qualifiers
- short-paper
Conference

Acceptance Rates
MM '16 Paper Acceptance Rate52of237submissions,22%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 243
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media

MM '16: Proceedings of the 24th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Learning Joint Multimodal Representation with Adversarial Attention Networks

Towards an intelligent framework for multimodal affective data analysis

Bidirectional Joint Representation Learning with Symmetrical Deep Neural Networks for Multimodal and Crossmodal Applications