research-article

Scalable similarity search with optimized kernel hashing

Authors:
Junfeng He

Columbia University, New York, NY, USA

Columbia University, New York, NY, USA
View Profile

,
Wei Liu

Columbia University, New York, NY, USA

Columbia University, New York, NY, USA
View Profile

,
Shih-Fu Chang

Columbia University, New York, NY, USA

Columbia University, New York, NY, USA
View Profile

KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data miningJuly 2010Pages 1129–1138https://doi.org/10.1145/1835804.1835946

Published:25 July 2010Publication History

KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 1129–1138

ABSTRACT

Scalable similarity search is the core of many large scale learning or data mining applications. Recently, many research results demonstrate that one promising approach is creating compact and efficient hash codes that preserve data similarity. By efficient, we refer to the low correlation (and thus low redundancy) among generated codes. However, most existing hash methods are designed only for vector data. In this paper, we develop a new hashing algorithm to create efficient codes for large scale data of general formats with any kernel function, including kernels on vectors, graphs, sequences, sets and so on. Starting with the idea analogous to spectral hashing, novel formulations and solutions are proposed such that a kernel based hash function can be explicitly represented and optimized, and directly applied to compute compact hash codes for new samples of general formats. Moreover, we incorporate efficient techniques, such as Nystrom approximation, to further reduce time and space complexity for indexing and search, making our algorithm scalable to huge data sets. Another important advantage of our method is the ability to handle diverse types of similarities according to actual task requirements, including both feature similarities and semantic similarities like label consistency. We evaluate our method using both vector and non-vector data sets at a large scale up to 1 million samples. Our comprehensive results show the proposed method outperforms several state-of-the-art approaches for all the tasks, with a significant gain for most tasks.

Supplemental Material

kdd2010_he_sss_01.mov

mov

121.2 MB

Download

References

J. Freidman, J. Bentley, and A. Finkel. An Algorithm for Finding BestMatches in Logarithmic Expected Time. ACM Transactions on Mathematical Software, page 209--226, 1977. Google ScholarDigital Library
J. Uhlmann. Satisfying general proximity / similarity queries with metric trees. Information Processing Letters, page 175--179, 1991.Google Scholar
P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of 30th Symposium on Theory of Computing (STOC), 1998. Google ScholarDigital Library
A. Gionis, P. Indyk, and R.Motwani. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases, 1999. Google ScholarDigital Library
M. Charikar. Similarity search in high dimensions via hashing. In Proceedings of the ACM Symposium on Theory of Computing, 2002.Google Scholar
M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. Locality sensitive hashing scheme based on p-Stable distributions. In Proceedings of Symposium on Computational Geometry (SOCG), 2004. Google ScholarDigital Library
K. Grauman and T. Darrell. Pyramid match hashing: sub-Linear time indexing over partial correspondences. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2007.Google Scholar
P. Jain, B. Kulis, and K. Grauman. Fast image search for learned metrics. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2008.Google Scholar
R. Salakhutdinov and G. Hinton. Semantic hashing. In Proceedings of ACM SIGIR Special Interest Group on Information Retrieval, 2007.Google Scholar
R. Salakhutdinov and G. Hinton. Learning a nonlinear embedding by preserving class neighbourhood structure. In Proceedings of AI and Statistics, 2007.Google Scholar
A. Torralba, R. Fergus, and Y.Weiss. Small Codes and Large Image Databases for Recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2008.Google ScholarCross Ref
Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Proceedings of Advances in Neural Information Processing Systems, 2008.Google Scholar
Brian Kulis and Kristen Grauman. Kernelized Locality-Sensitive Hashing for Scalable Image Search. In Proceedings of 12th International Conference on Computer Vision, 2009.Google Scholar
Brian Kulis and Trevor Darrell. Learning to hash with binary reconstructive embeddings. In Proceedings of Advances in Neural Information Processing Systems, 2009.Google Scholar
M. Raginsky and S. Lazebnik. Locality sensitive binary codes from shift-invariant kernels. In Proceedings of Advances in Neural Information Processing Systems, 2009.Google Scholar
Kave Eshghi and Shyamsundar Rajaram. Locality sensitive hash functions based on concomitant rank order statistics. In Proceedings of ACM SIGKDD conference on Knowledge Discovery and Data Mining, 2008. Google ScholarDigital Library
Xiaofei He, and Partha Niyogi. Locality Preserving Projections. In Proceedings of Advances in Neural Information Processing Systems, 2003.Google Scholar
A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal on Computer Vision, 2001. Google ScholarDigital Library
D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal on Computer Vision, 2004. Google ScholarDigital Library
Vladimir N Vapnik. The Nature of Statistical Learning Theory, 1995. Google ScholarDigital Library
Christopher Williams, Matthias Seeger. Using the nystr¨om method to speed up kernel machines. In Proceedings of Advances in Neural Information Processing Systems, 2001.Google Scholar
Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Proceedings of Advances in Neural Information Processing Systems, 2001.Google Scholar
L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In Proceedings of CVPR Workshop of Generative Model Based Vision, 2004. Google ScholarDigital Library
S. Lazebnik, C. Schmid and J. Ponce. Beyond bags of features, spatial pyramid matching for recognizing natural scene categories. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2006. Google ScholarDigital Library
N. Snavely, S. Seitz, and R. Szeliski. Photo tourism: exploring photo collections in 3D. In Proceedings of SIGGRAPH, 2006. Google ScholarDigital Library
Nino Shervashidze, Karsten M. Borgwardt. Fast subtree kernels on graphs. In Proceedings of Advances in Neural Information Processing Systems, 2009.Google Scholar
N. Wale and G. Karypis. Comparison of descriptor spaces for chemical compound retrieval and classification. In Proceedings of ICDM, 2006 Google ScholarDigital Library

Index Terms

Scalable similarity search with optimized kernel hashing
1. Information systems
  1. Information retrieval

Recommendations

Weighted hashing for fast large scale similarity search
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Similarity search, or finding approximate nearest neighbors, is an important technique for many applications. Many recent research demonstrate that hashing methods can achieve promising results for large scale similarity search due to its computational ...
Read More
An Examination of Multivariate Time Series Hashing with Applications to Health Care
ICDM '14: Proceedings of the 2014 IEEE International Conference on Data Mining

As large-scale multivariate time series data become increasingly common in application domains, such as health care and traffic analysis, researchers are challenged to build efficient tools to analyze it and provide useful insights. Similarity search, ...
Read More
Distributed KNN-graph approximation via hashing
ICMR '12: Proceedings of the 2nd ACM International Conference on Multimedia Retrieval

Efficiently constructing the K-Nearest Neighbor Graph (K-NNG) of large and high dimensional datasets is crucial for many applications with feature-rich objects, such as images or other multimedia content. In this paper we investigate the use of high ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
July 2010
1240 pages
ISBN:9781450300551
DOI:10.1145/1835804
General Chairs:
Bharat Rao
Siemens
,
Balaji Krishnapuram
Siemens
,
Program Chairs:
Andrew Tomkins
Google Inc.
,
Qiang Yang
Hong Kong University of Science and Technology
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
hashing
indexing
kernel method
nearest neighbor
scalable
search
structure data
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 104
  Total Citations
  View Citations
- 748
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scalable similarity search with optimized kernel hashing

KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Weighted hashing for fast large scale similarity search

An Examination of Multivariate Time Series Hashing with Applications to Health Care

Distributed KNN-graph approximation via hashing