Comprehensive Distance-Preserving Autoencoders for Cross-Modal Retrieval

Authors:
Yibing Zhan

Hangzhou Dianzi University, Hangzhou, China

Hangzhou Dianzi University, Hangzhou, China
View Profile

,
Jun Yu

Hangzhou Dianzi University, Hangzhou, China

Hangzhou Dianzi University, Hangzhou, China
View Profile

,
Zhou Yu

Hangzhou Dianzi University, Hangzhou, China

Hangzhou Dianzi University, Hangzhou, China
View Profile

,
Rong Zhang

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China
View Profile

,
Dacheng Tao

University of Sydney, Sydney, Australia

University of Sydney, Sydney, Australia
View Profile

,
Qi Tian

Huawei Noah's Ark Lab&University of Texas at San Antonio, San Antonio, TX, USA

Huawei Noah's Ark Lab&University of Texas at San Antonio, San Antonio, TX, USA
View Profile

MM '18: Proceedings of the 26th ACM international conference on MultimediaOctober 2018Pages 1137–1145https://doi.org/10.1145/3240508.3240607

Published:15 October 2018Publication History

MM '18: Proceedings of the 26th ACM international conference on Multimedia

Pages 1137–1145

ABSTRACT

In this paper, we propose a novel method with comprehensive distance-preserving autoencoders (CDPAE) to address the problem of unsupervised cross-modal retrieval. Previous unsupervised methods rely primarily on pairwise distances of representations extracted from cross media spaces that co-occur and belong to the same objects. However, besides pairwise distances, the CDPAE also considers heterogeneous distances of representations extracted from cross media spaces as well as homogeneous distances of representations extracted from single media spaces that belong to different objects. The CDPAE consists of four components. First, denoising autoencoders are used to retain the information from the representations and to reduce the negative influence of redundant noises. Second, a comprehensive distance-preserving common space is proposed to explore the correlations among different representations. This aims to preserve the respective distances between the representations within the common space so that they are consistent with the distances in their original media spaces. Third, a novel joint loss function is defined to simultaneously calculate the reconstruction loss of the denoising autoencoders and the correlation loss of the comprehensive distance-preserving common space. Finally, an unsupervised cross-modal similarity measurement is proposed to further improve the retrieval performance. This is carried out by calculating the marginal probability of two media objects based on a kNN classifier. The CDPAE is tested on four public datasets with two cross-modal retrieval tasks: "query images by texts" and "query texts by images". Compared with eight state-of-the-art cross-modal retrieval methods, the experimental results demonstrate that the CDPAE outperforms all the unsupervised methods and performs competitively with the supervised methods.

References

Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In International Conference on Machine Learning. 1247--1255. Google ScholarDigital Library
David M Blei and Michael I Jordan. 2003. Modeling annotated data. In Proceedings of the 26th annual international ACMSIGIR conference on Research and development in informaion retrieval. ACM, 127--134. Google ScholarDigital Library
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: a real-world web image database from National University of Singapore. In Proceedings of the ACM international conference on image and video retrieval. ACM, 48. Google ScholarDigital Library
Stéphane Clinchant, Julien Ah-Pine, and Gabriela Csurka. 2011. Semantic combination of textual and visual information in multimedia retrieval. In Proceedings of the 1st ACM international conference on multimedia retrieval. ACM, 44. Google ScholarDigital Library
Fangxiang Feng, Ruifan Li, and Xiaojie Wang. 2015. Deep correspondence restricted Boltzmann machine for cross-modal retrieval. Neurocomputing 154 (2015), 50--60. Google ScholarDigital Library
Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, 7--16. Google ScholarDigital Library
Li He, Xing Xu, Huimin Lu, Yang Yang, Fumin Shen, and Heng Tao Shen. 2017. Unsupervised cross-modal retrieval through adversarial learning. In Multimedia and Expo (ICME), 2017 IEEE International Conference on. IEEE, 1153--1158.Google ScholarCross Ref
Harold Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3/4 (1936), 321--377.Google ScholarCross Ref
Venice Erin Liong, Jiwen Lu, Yap-Peng Tan, and Jie Zhou. 2017. Cross-modal deep variational hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4077--4085.Google ScholarCross Ref
Yingwei Pan, Ting Yao, Tao Mei, Houqiang Li, Chong-Wah Ngo, and Yong Rui. 2014. Click-through-based cross-view learning for image search. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. ACM, 717--726. Google ScholarDigital Library
Yuxin Peng, Xin Huang, and Yunzhen Zhao. 2017. An overview of cross-media retrieval: Concepts, methodologies, benchmarks and challenges. IEEE Transactions on Circuits and Systems for Video Technology (2017). Google ScholarDigital Library
Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2018. CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network. IEEE Transactions on Multimedia 20, 2 (2018), 405--420. Google ScholarDigital Library
Yuxin Peng, Xiaohua Zhai, Yunzhen Zhao, and Xin Huang. 2016. Semi-supervised cross-media feature learning with unified patch graph regularization. IEEE Transactions on Circuits and Systems for Video Technology 26, 3 (2016), 583--596. Google ScholarDigital Library
Duangmanee Putthividhy, Hagai T Attias, and Srikantan S Nagarajan. 2010. Topic regression multi-modal latent dirichlet allocation for image annotation. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 3408--3415.Google ScholarCross Ref
Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using Amazon's Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk. Association for Computational Linguistics, 139--147. Google ScholarDigital Library
Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to crossmodal multimedia retrieval. In Proceedings of the 18th ACM international conference on Multimedia. ACM, 251--260. Google ScholarDigital Library
Fumin Shen, Xiang Zhou, Yang Yang, Jingkuan Song, Heng Tao Shen, and Dacheng Tao. 2016. A fast optimization method for general binary code learning. IEEE Transactions on Image Processing 25, 12 (2016), 5610--5621. Google ScholarDigital Library
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning. ACM, 1096--1103. Google ScholarDigital Library
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial Cross-Modal Retrieval. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 154--162. Google ScholarDigital Library
KaiyeWang, Ran He, LiangWang,WeiWang, and Tieniu Tan. 2016. Joint feature selection and subspace learning for cross-modal retrieval. IEEE transactions on pattern analysis and machine intelligence 38, 10 (2016), 2010--2023. Google ScholarDigital Library
Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016. A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215 (2016).Google Scholar
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structurepreserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5005--5013.Google ScholarCross Ref
WeiranWang, Raman Arora, Karen Livescu, and Jeff Bilmes. 2015. On deep multiview representation learning. In International Conference on Machine Learning. 1083--1092. Google ScholarDigital Library
WeiWang, Beng Chin Ooi, Xiaoyan Yang, Dongxiang Zhang, and Yueting Zhuang. 2014. Effective multi-modal retrieval based on stacked auto-encoders. Proceedings of the VLDB Endowment 7, 8 (2014), 649--660. Google ScholarDigital Library
Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 3441--3450.Google ScholarCross Ref
Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2012. Cross-modality correlation propagation for cross-media retrieval. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2337--2340.Google ScholarCross Ref
Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2014. Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Transactions on Circuits and Systems for Video Technology 24, 6 (2014), 965--978.Google ScholarCross Ref

Index Terms

Comprehensive Distance-Preserving Autoencoders for Cross-Modal Retrieval
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Cross-modal Retrieval with Label Completion
MM '16: Proceedings of the 24th ACM international conference on Multimedia

Cross-modal retrieval has been attracting increasing attention because of the explosion of multi-modal data, e.g., texts and images. Most supervised cross-modal retrieval methods learn discriminant common subspaces minimizing the heterogeneity of ...
Read More
Multimodal Disentanglement Variational AutoEncoders for Zero-Shot Cross-Modal Retrieval
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Zero-Shot Cross-Modal Retrieval (ZS-CMR) has recently drawn increasing attention as it focuses on a practical retrieval scenario, i.e., the multimodal test set consists of unseen classes that are disjoint with seen classes in the training set. The ...
Read More
Boosting deep cross-modal retrieval hashing with adversarially robust training
Abstract
Deep hashing methods effectively enhance the performance of conventional machine learning retrieval models, particularly in visual medium evolving cross-modal retrieval tasks, by relying on the outstanding feature extraction ability of deep neural ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '18: Proceedings of the 26th ACM international conference on Multimedia
October 2018
2167 pages
ISBN:9781450356657
DOI:10.1145/3240508
General Chairs:
Susanne Boll
University of Oldenburg, Germany
,
Kyoung Mu Lee
Seoul National University, Korea
,
Jiebo Luo
University of Rochester, USA
,
Wenwu Zhu
Tsinghua University, China
,
Program Chairs:
Hyeran Byun
Yonsei University, Korea
,
Chang Wen Chen
State Univ. Of New York at Buffalo, USA
,
Rainer Lienhart
University of Augsburg, Germany
,
Tao Mei
JD AI, China
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 October 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
autoencoder
comprehensive distancepreserving
cross-modal retrieval
similarity measurement
unsupervised
Qualifiers
- research-article
Conference

Acceptance Rates
MM '18 Paper Acceptance Rate209of757submissions,28%Overall Acceptance Rate995of4,171submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 1,321
  Total Downloads
- Downloads (Last 12 months)161
- Downloads (Last 6 weeks)17
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Comprehensive Distance-Preserving Autoencoders for Cross-Modal Retrieval

MM '18: Proceedings of the 26th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Cross-modal Retrieval with Label Completion

Multimodal Disentanglement Variational AutoEncoders for Zero-Shot Cross-Modal Retrieval

Boosting deep cross-modal retrieval hashing with adversarially robust training