Deep Visual-Semantic Hashing for Cross-Modal Retrieval

Authors:
Yue Cao

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Mingsheng Long

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Jianmin Wang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Qiang Yang

Hong Kong University of Science and Technology, Hong Kong, China

Hong Kong University of Science and Technology, Hong Kong, China
View Profile

,
Philip S. Yu

University of Illinois at Chicago & Tsinghua University, Chicago, USA

University of Illinois at Chicago & Tsinghua University, Chicago, USA
View Profile

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningAugust 2016Pages 1445–1454https://doi.org/10.1145/2939672.2939812

Published:13 August 2016Publication History

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 1445–1454

ABSTRACT

Due to the storage and retrieval efficiency, hashing has been widely applied to approximate nearest neighbor search for large-scale multimedia retrieval. Cross-modal hashing, which enables efficient retrieval of images in response to text queries or vice versa, has received increasing attention recently. Most existing work on cross-modal hashing does not capture the spatial dependency of images and temporal dynamics of text sentences for learning powerful feature representations and cross-modal embeddings that mitigate the heterogeneity of different modalities. This paper presents a new Deep Visual-Semantic Hashing (DVSH) model that generates compact hash codes of images and sentences in an end-to-end deep learning architecture, which capture the intrinsic cross-modal correspondences between visual data and natural language. DVSH is a hybrid deep architecture that constitutes a visual-semantic fusion network for learning joint embedding space of images and text sentences, and two modality-specific hashing networks for learning hash functions to generate compact binary codes. Our architecture effectively unifies joint multimodal embedding and cross-modal hashing, which is based on a novel combination of Convolutional Neural Networks over images, Recurrent Neural Networks over sentences, and a structured max-margin objective that integrates all things together to enable learning of similarity-preserving and high-quality hash codes. Extensive empirical evidence shows that our DVSH approach yields state of the art results in cross-modal retrieval experiments on image-sentences datasets, i.e. standard IAPR TC-12 and large-scale Microsoft COCO.

References

J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Deep compositional question answering with neural module networks. CVPR, 2016.Google Scholar
G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlation analysis. In ICML, 2013.Google ScholarDigital Library
Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. TPAMI, 35, 2013. Google ScholarDigital Library
M. Bronstein, A. Bronstein, F. Michel, and N. Paragios. Data fusion through cross-modality metric learning using similarity-sensitive hashing. In CVPR. IEEE, 2010.Google ScholarCross Ref
Y. Cao, M. Long, J. Wang, H. Zhu, and Q. Wen. Deep quantization network for efficient image retrieval. In AAAI, 2016.Google ScholarDigital Library
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.Google ScholarCross Ref
J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.Google ScholarDigital Library
F. Feng, X. Wang, and R. Li. Cross-modal retrieval with correspondence autoencoder. In MM. ACM, 2014. Google ScholarDigital Library
A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visual-semantic embedding model. In NIPS, pages 2121--2129, 2013.Google ScholarDigital Library
H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. Are you talking to a machine? dataset and methods for multilingual image question answering. In NIPS, 2015. Google ScholarDigital Library
A. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In ICML, pages 1764--1772, 2014.Google ScholarDigital Library
A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. JMLR, 13:723--773, Mar. 2012. Google ScholarDigital Library
M. Grubinger, P. Clough, H. Müller, and T. Deselaers. The iapr tc-12 benchmark: A new evaluation resource for visual information systems. In International Workshop OntoImage, pages 13--23, 2006.Google Scholar
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735--1780, 1997. Google ScholarDigital Library
R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. Natural language object retrieval. CVPR, 2016.Google ScholarCross Ref
Y. Hu, Z. Jin, H. Ren, D. Cai, and X. He. Iterative multi-view hashing for cross media indexing. In MM. ACM, 2014. Google ScholarDigital Library
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In MM. ACM, 2014. Google ScholarDigital Library
A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, pages 3128--3137, 2015.Google ScholarCross Ref
R. Kiros, R. Salakhutdinov, and R. Zemel. Multimodal neural language models. In T. Jebara and E. P. Xing, editors, ICML, pages 595--603. JMLR Workshop and Conference Proceedings, 2014.Google Scholar
R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. In NIPS, 2014.Google Scholar
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.Google ScholarDigital Library
S. Kumar and R. Udupa. Learning hash functions for cross-view similarity search. In IJCAI, 2011. Google ScholarDigital Library
H. Lai, Y. Pan, Y. Liu, and S. Yan. Simultaneous feature learning and hash coding with deep neural networks. In CVPR. IEEE, 2015.Google ScholarCross Ref
T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014.Google Scholar
Z. Lin, G. Ding, M. Hu, and J. Wang. Semantics-preserving hashing for cross-view retrieval. In CVPR, 2015.Google ScholarCross Ref
W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In CVPR. IEEE, 2012.Google Scholar
X. Liu, J. He, C. Deng, and B. Lang. Collaborative hashing. In CVPR. IEEE, 2014. Google ScholarDigital Library
M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. In ICML, 2015.Google ScholarDigital Library
M. Long, Y. Cao, J. Wang, and P. S. Yu. Composite correlation quantization for efficient multimodal search. SIGIR, 2016. Google ScholarDigital Library
J. Masci, M. M. Bronstein, A. M. Bronstein, and J. Schmidhuber. Multimodal similarity-preserving hashing. TPAMI, 36, 2014. Google ScholarDigital Library
M. Ou, P. Cui, F. Wang, J. Wang, W. Zhu, and S. Yang. Comparing apples to oranges: a scalable solution with heterogeneous hashing. In KDD, pages 230--238. ACM, 2013. Google ScholarDigital Library
A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. TPAMI, 22, 2000. Google ScholarDigital Library
J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen. Inter-media hashing for large-scale retrieval from heterogeneous data sources. In SIGMOD. ACM, 2013. Google ScholarDigital Library
N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. JMLR, 15, 2014. Google ScholarDigital Library
I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, pages 3104--3112, 2014. Google ScholarDigital Library
J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. arXiv preprint arXiv:1408.2927, 2014.Google Scholar
W. Wang, B. C. Ooi, X. Yang, D. Zhang, and Y. Zhuang. Effective multi-modal retrieval based on stacked auto-encoders. In VLDB. ACM, 2014. Google ScholarDigital Library
Y. Wei, Y. Song, Y. Zhen, B. Liu, and Q. Yang. Scalable heterogeneous translated hashing. In KDD, pages 791--800. ACM, 2014. Google ScholarDigital Library
B. Wu, Q. Yang, W. Zheng, Y. Wang, and J. Wang. Quantized correlation hashing for fast cross-modal search. In IJCAI, 2015. Google ScholarDigital Library
R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. Supervised hashing for image retrieval via image representation learning. In AAAI. AAAI, 2014. Google ScholarDigital Library
Z. Yu, F. Wu, Y. Yang, Q. Tian, J. Luo, and Y. Zhuang. Discriminative coupled dictionary hashing for fast cross-media retrieval. In SIGIR. ACM, 2014. Google ScholarDigital Library
W. Zaremba and I. Sutskever. Learning to execute. CoRR, abs/1410.4615, 2014.Google Scholar
D. Zhang and W. Li. Large-scale supervised multimodal hashing with semantic correlation maximization. In AAAI, 2014. Google ScholarDigital Library
Y. Zhen and D.-Y. Yeung. Co-regularized hashing for multimodal data. In NIPS, 2012.Google ScholarDigital Library
Y. Zhen and D.-Y. Yeung. A probabilistic model for multimodal hash function learning. In SIGKDD. ACM, 2012. Google ScholarDigital Library
H. Zhu, M. Long, J. Wang, and Y. Cao. Deep hashing network for efficient similarity retrieval. In AAAI, 2016.Google ScholarDigital Library

Index Terms

Deep Visual-Semantic Hashing for Cross-Modal Retrieval
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

A novel deep translated attention hashing for cross-modal retrieval
Abstract
In recent years, driven by the increasing number of cross-modal data such as images and texts, cross-modal retrieval has received intensive attention. Great progress has made in deep cross-modal hash retrieval, which integrates feature leaning and ...
Read More
Discrete Fusion Adversarial Hashing for cross-modal retrieval
Abstract
Deep cross-modal hashing enables a flexible and efficient way for large-scale cross-modal retrieval. Existing cross-modal retrieval methods based on deep hashing aim to learn the unified hashing representation for different modalities ...
Read More
Deep semantic hashing with dual attention for cross-modal retrieval
Abstract
With the explosive growth of multimodal data, cross-modal retrieval has drawn increasing research interests. Hashing-based methods have made great advancements in cross-modal retrieval due to the benefits of low storage cost and fast query speed. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2016
2176 pages
ISBN:9781450342322
DOI:10.1145/2939672
General Chairs:
Balaji Krishnapuram
IBM
,
Mohak Shah
Bosch
,
Program Chairs:
Alex Smola
Amazon
,
Charu Aggarwal
IBM
,
Dou Shen
Baidu
,
Rajeev Rastogi
Amazon
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 August 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cross-modal retrieval
deep hashing
multimodal embedding
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '16 Paper Acceptance Rate66of1,115submissions,6%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 186
  Total Citations
  View Citations
- 2,648
  Total Downloads
- Downloads (Last 12 months)289
- Downloads (Last 6 weeks)46
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Deep Visual-Semantic Hashing for Cross-Modal Retrieval

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

A novel deep translated attention hashing for cross-modal retrieval

Discrete Fusion Adversarial Hashing for cross-modal retrieval

Deep semantic hashing with dual attention for cross-modal retrieval