skip to main content
10.1145/2939672.2939812acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Public Access

Deep Visual-Semantic Hashing for Cross-Modal Retrieval

Authors Info & Claims
Published:13 August 2016Publication History

ABSTRACT

Due to the storage and retrieval efficiency, hashing has been widely applied to approximate nearest neighbor search for large-scale multimedia retrieval. Cross-modal hashing, which enables efficient retrieval of images in response to text queries or vice versa, has received increasing attention recently. Most existing work on cross-modal hashing does not capture the spatial dependency of images and temporal dynamics of text sentences for learning powerful feature representations and cross-modal embeddings that mitigate the heterogeneity of different modalities. This paper presents a new Deep Visual-Semantic Hashing (DVSH) model that generates compact hash codes of images and sentences in an end-to-end deep learning architecture, which capture the intrinsic cross-modal correspondences between visual data and natural language. DVSH is a hybrid deep architecture that constitutes a visual-semantic fusion network for learning joint embedding space of images and text sentences, and two modality-specific hashing networks for learning hash functions to generate compact binary codes. Our architecture effectively unifies joint multimodal embedding and cross-modal hashing, which is based on a novel combination of Convolutional Neural Networks over images, Recurrent Neural Networks over sentences, and a structured max-margin objective that integrates all things together to enable learning of similarity-preserving and high-quality hash codes. Extensive empirical evidence shows that our DVSH approach yields state of the art results in cross-modal retrieval experiments on image-sentences datasets, i.e. standard IAPR TC-12 and large-scale Microsoft COCO.

References

  1. J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Deep compositional question answering with neural module networks. CVPR, 2016.Google ScholarGoogle Scholar
  2. G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlation analysis. In ICML, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. TPAMI, 35, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Bronstein, A. Bronstein, F. Michel, and N. Paragios. Data fusion through cross-modality metric learning using similarity-sensitive hashing. In CVPR. IEEE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  5. Y. Cao, M. Long, J. Wang, H. Zhu, and Q. Wen. Deep quantization network for efficient image retrieval. In AAAI, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  7. J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. F. Feng, X. Wang, and R. Li. Cross-modal retrieval with correspondence autoencoder. In MM. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visual-semantic embedding model. In NIPS, pages 2121--2129, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. Are you talking to a machine? dataset and methods for multilingual image question answering. In NIPS, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In ICML, pages 1764--1772, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. JMLR, 13:723--773, Mar. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Grubinger, P. Clough, H. Müller, and T. Deselaers. The iapr tc-12 benchmark: A new evaluation resource for visual information systems. In International Workshop OntoImage, pages 13--23, 2006.Google ScholarGoogle Scholar
  14. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735--1780, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. Natural language object retrieval. CVPR, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  16. Y. Hu, Z. Jin, H. Ren, D. Cai, and X. He. Iterative multi-view hashing for cross media indexing. In MM. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In MM. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, pages 3128--3137, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  19. R. Kiros, R. Salakhutdinov, and R. Zemel. Multimodal neural language models. In T. Jebara and E. P. Xing, editors, ICML, pages 595--603. JMLR Workshop and Conference Proceedings, 2014.Google ScholarGoogle Scholar
  20. R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. In NIPS, 2014.Google ScholarGoogle Scholar
  21. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Kumar and R. Udupa. Learning hash functions for cross-view similarity search. In IJCAI, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. H. Lai, Y. Pan, Y. Liu, and S. Yan. Simultaneous feature learning and hash coding with deep neural networks. In CVPR. IEEE, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  24. T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014.Google ScholarGoogle Scholar
  25. Z. Lin, G. Ding, M. Hu, and J. Wang. Semantics-preserving hashing for cross-view retrieval. In CVPR, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  26. W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In CVPR. IEEE, 2012.Google ScholarGoogle Scholar
  27. X. Liu, J. He, C. Deng, and B. Lang. Collaborative hashing. In CVPR. IEEE, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. In ICML, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Long, Y. Cao, J. Wang, and P. S. Yu. Composite correlation quantization for efficient multimodal search. SIGIR, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Masci, M. M. Bronstein, A. M. Bronstein, and J. Schmidhuber. Multimodal similarity-preserving hashing. TPAMI, 36, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Ou, P. Cui, F. Wang, J. Wang, W. Zhu, and S. Yang. Comparing apples to oranges: a scalable solution with heterogeneous hashing. In KDD, pages 230--238. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. TPAMI, 22, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen. Inter-media hashing for large-scale retrieval from heterogeneous data sources. In SIGMOD. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. JMLR, 15, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, pages 3104--3112, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. arXiv preprint arXiv:1408.2927, 2014.Google ScholarGoogle Scholar
  37. W. Wang, B. C. Ooi, X. Yang, D. Zhang, and Y. Zhuang. Effective multi-modal retrieval based on stacked auto-encoders. In VLDB. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Y. Wei, Y. Song, Y. Zhen, B. Liu, and Q. Yang. Scalable heterogeneous translated hashing. In KDD, pages 791--800. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. B. Wu, Q. Yang, W. Zheng, Y. Wang, and J. Wang. Quantized correlation hashing for fast cross-modal search. In IJCAI, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. Supervised hashing for image retrieval via image representation learning. In AAAI. AAAI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Z. Yu, F. Wu, Y. Yang, Q. Tian, J. Luo, and Y. Zhuang. Discriminative coupled dictionary hashing for fast cross-media retrieval. In SIGIR. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. W. Zaremba and I. Sutskever. Learning to execute. CoRR, abs/1410.4615, 2014.Google ScholarGoogle Scholar
  43. D. Zhang and W. Li. Large-scale supervised multimodal hashing with semantic correlation maximization. In AAAI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Y. Zhen and D.-Y. Yeung. Co-regularized hashing for multimodal data. In NIPS, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Y. Zhen and D.-Y. Yeung. A probabilistic model for multimodal hash function learning. In SIGKDD. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. H. Zhu, M. Long, J. Wang, and Y. Cao. Deep hashing network for efficient similarity retrieval. In AAAI, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Deep Visual-Semantic Hashing for Cross-Modal Retrieval

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
        August 2016
        2176 pages
        ISBN:9781450342322
        DOI:10.1145/2939672

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 August 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        KDD '16 Paper Acceptance Rate66of1,115submissions,6%Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader