skip to main content
10.1145/3240508.3240607acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Public Access

Comprehensive Distance-Preserving Autoencoders for Cross-Modal Retrieval

Authors Info & Claims
Published:15 October 2018Publication History

ABSTRACT

In this paper, we propose a novel method with comprehensive distance-preserving autoencoders (CDPAE) to address the problem of unsupervised cross-modal retrieval. Previous unsupervised methods rely primarily on pairwise distances of representations extracted from cross media spaces that co-occur and belong to the same objects. However, besides pairwise distances, the CDPAE also considers heterogeneous distances of representations extracted from cross media spaces as well as homogeneous distances of representations extracted from single media spaces that belong to different objects. The CDPAE consists of four components. First, denoising autoencoders are used to retain the information from the representations and to reduce the negative influence of redundant noises. Second, a comprehensive distance-preserving common space is proposed to explore the correlations among different representations. This aims to preserve the respective distances between the representations within the common space so that they are consistent with the distances in their original media spaces. Third, a novel joint loss function is defined to simultaneously calculate the reconstruction loss of the denoising autoencoders and the correlation loss of the comprehensive distance-preserving common space. Finally, an unsupervised cross-modal similarity measurement is proposed to further improve the retrieval performance. This is carried out by calculating the marginal probability of two media objects based on a kNN classifier. The CDPAE is tested on four public datasets with two cross-modal retrieval tasks: "query images by texts" and "query texts by images". Compared with eight state-of-the-art cross-modal retrieval methods, the experimental results demonstrate that the CDPAE outperforms all the unsupervised methods and performs competitively with the supervised methods.

References

  1. Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In International Conference on Machine Learning. 1247--1255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. David M Blei and Michael I Jordan. 2003. Modeling annotated data. In Proceedings of the 26th annual international ACMSIGIR conference on Research and development in informaion retrieval. ACM, 127--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: a real-world web image database from National University of Singapore. In Proceedings of the ACM international conference on image and video retrieval. ACM, 48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Stéphane Clinchant, Julien Ah-Pine, and Gabriela Csurka. 2011. Semantic combination of textual and visual information in multimedia retrieval. In Proceedings of the 1st ACM international conference on multimedia retrieval. ACM, 44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Fangxiang Feng, Ruifan Li, and Xiaojie Wang. 2015. Deep correspondence restricted Boltzmann machine for cross-modal retrieval. Neurocomputing 154 (2015), 50--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, 7--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Li He, Xing Xu, Huimin Lu, Yang Yang, Fumin Shen, and Heng Tao Shen. 2017. Unsupervised cross-modal retrieval through adversarial learning. In Multimedia and Expo (ICME), 2017 IEEE International Conference on. IEEE, 1153--1158.Google ScholarGoogle ScholarCross RefCross Ref
  8. Harold Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3/4 (1936), 321--377.Google ScholarGoogle ScholarCross RefCross Ref
  9. Venice Erin Liong, Jiwen Lu, Yap-Peng Tan, and Jie Zhou. 2017. Cross-modal deep variational hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4077--4085.Google ScholarGoogle ScholarCross RefCross Ref
  10. Yingwei Pan, Ting Yao, Tao Mei, Houqiang Li, Chong-Wah Ngo, and Yong Rui. 2014. Click-through-based cross-view learning for image search. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. ACM, 717--726. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Yuxin Peng, Xin Huang, and Yunzhen Zhao. 2017. An overview of cross-media retrieval: Concepts, methodologies, benchmarks and challenges. IEEE Transactions on Circuits and Systems for Video Technology (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2018. CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network. IEEE Transactions on Multimedia 20, 2 (2018), 405--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Yuxin Peng, Xiaohua Zhai, Yunzhen Zhao, and Xin Huang. 2016. Semi-supervised cross-media feature learning with unified patch graph regularization. IEEE Transactions on Circuits and Systems for Video Technology 26, 3 (2016), 583--596. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Duangmanee Putthividhy, Hagai T Attias, and Srikantan S Nagarajan. 2010. Topic regression multi-modal latent dirichlet allocation for image annotation. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 3408--3415.Google ScholarGoogle ScholarCross RefCross Ref
  15. Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using Amazon's Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk. Association for Computational Linguistics, 139--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to crossmodal multimedia retrieval. In Proceedings of the 18th ACM international conference on Multimedia. ACM, 251--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Fumin Shen, Xiang Zhou, Yang Yang, Jingkuan Song, Heng Tao Shen, and Dacheng Tao. 2016. A fast optimization method for general binary code learning. IEEE Transactions on Image Processing 25, 12 (2016), 5610--5621. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning. ACM, 1096--1103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial Cross-Modal Retrieval. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 154--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. KaiyeWang, Ran He, LiangWang,WeiWang, and Tieniu Tan. 2016. Joint feature selection and subspace learning for cross-modal retrieval. IEEE transactions on pattern analysis and machine intelligence 38, 10 (2016), 2010--2023. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016. A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215 (2016).Google ScholarGoogle Scholar
  22. Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structurepreserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5005--5013.Google ScholarGoogle ScholarCross RefCross Ref
  23. WeiranWang, Raman Arora, Karen Livescu, and Jeff Bilmes. 2015. On deep multiview representation learning. In International Conference on Machine Learning. 1083--1092. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. WeiWang, Beng Chin Ooi, Xiaoyan Yang, Dongxiang Zhang, and Yueting Zhuang. 2014. Effective multi-modal retrieval based on stacked auto-encoders. Proceedings of the VLDB Endowment 7, 8 (2014), 649--660. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 3441--3450.Google ScholarGoogle ScholarCross RefCross Ref
  26. Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2012. Cross-modality correlation propagation for cross-media retrieval. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2337--2340.Google ScholarGoogle ScholarCross RefCross Ref
  27. Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2014. Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Transactions on Circuits and Systems for Video Technology 24, 6 (2014), 965--978.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Comprehensive Distance-Preserving Autoencoders for Cross-Modal Retrieval

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '18: Proceedings of the 26th ACM international conference on Multimedia
        October 2018
        2167 pages
        ISBN:9781450356657
        DOI:10.1145/3240508

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 October 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        MM '18 Paper Acceptance Rate209of757submissions,28%Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader