ABSTRACT
Scalable similarity search is the core of many large scale learning or data mining applications. Recently, many research results demonstrate that one promising approach is creating compact and efficient hash codes that preserve data similarity. By efficient, we refer to the low correlation (and thus low redundancy) among generated codes. However, most existing hash methods are designed only for vector data. In this paper, we develop a new hashing algorithm to create efficient codes for large scale data of general formats with any kernel function, including kernels on vectors, graphs, sequences, sets and so on. Starting with the idea analogous to spectral hashing, novel formulations and solutions are proposed such that a kernel based hash function can be explicitly represented and optimized, and directly applied to compute compact hash codes for new samples of general formats. Moreover, we incorporate efficient techniques, such as Nystrom approximation, to further reduce time and space complexity for indexing and search, making our algorithm scalable to huge data sets. Another important advantage of our method is the ability to handle diverse types of similarities according to actual task requirements, including both feature similarities and semantic similarities like label consistency. We evaluate our method using both vector and non-vector data sets at a large scale up to 1 million samples. Our comprehensive results show the proposed method outperforms several state-of-the-art approaches for all the tasks, with a significant gain for most tasks.
Supplemental Material
- J. Freidman, J. Bentley, and A. Finkel. An Algorithm for Finding BestMatches in Logarithmic Expected Time. ACM Transactions on Mathematical Software, page 209--226, 1977. Google ScholarDigital Library
- J. Uhlmann. Satisfying general proximity / similarity queries with metric trees. Information Processing Letters, page 175--179, 1991.Google Scholar
- P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of 30th Symposium on Theory of Computing (STOC), 1998. Google ScholarDigital Library
- A. Gionis, P. Indyk, and R.Motwani. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases, 1999. Google ScholarDigital Library
- M. Charikar. Similarity search in high dimensions via hashing. In Proceedings of the ACM Symposium on Theory of Computing, 2002.Google Scholar
- M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. Locality sensitive hashing scheme based on p-Stable distributions. In Proceedings of Symposium on Computational Geometry (SOCG), 2004. Google ScholarDigital Library
- K. Grauman and T. Darrell. Pyramid match hashing: sub-Linear time indexing over partial correspondences. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2007.Google Scholar
- P. Jain, B. Kulis, and K. Grauman. Fast image search for learned metrics. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2008.Google Scholar
- R. Salakhutdinov and G. Hinton. Semantic hashing. In Proceedings of ACM SIGIR Special Interest Group on Information Retrieval, 2007.Google Scholar
- R. Salakhutdinov and G. Hinton. Learning a nonlinear embedding by preserving class neighbourhood structure. In Proceedings of AI and Statistics, 2007.Google Scholar
- A. Torralba, R. Fergus, and Y.Weiss. Small Codes and Large Image Databases for Recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2008.Google ScholarCross Ref
- Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Proceedings of Advances in Neural Information Processing Systems, 2008.Google Scholar
- Brian Kulis and Kristen Grauman. Kernelized Locality-Sensitive Hashing for Scalable Image Search. In Proceedings of 12th International Conference on Computer Vision, 2009.Google Scholar
- Brian Kulis and Trevor Darrell. Learning to hash with binary reconstructive embeddings. In Proceedings of Advances in Neural Information Processing Systems, 2009.Google Scholar
- M. Raginsky and S. Lazebnik. Locality sensitive binary codes from shift-invariant kernels. In Proceedings of Advances in Neural Information Processing Systems, 2009.Google Scholar
- Kave Eshghi and Shyamsundar Rajaram. Locality sensitive hash functions based on concomitant rank order statistics. In Proceedings of ACM SIGKDD conference on Knowledge Discovery and Data Mining, 2008. Google ScholarDigital Library
- Xiaofei He, and Partha Niyogi. Locality Preserving Projections. In Proceedings of Advances in Neural Information Processing Systems, 2003.Google Scholar
- A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal on Computer Vision, 2001. Google ScholarDigital Library
- D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal on Computer Vision, 2004. Google ScholarDigital Library
- Vladimir N Vapnik. The Nature of Statistical Learning Theory, 1995. Google ScholarDigital Library
- Christopher Williams, Matthias Seeger. Using the nystr¨om method to speed up kernel machines. In Proceedings of Advances in Neural Information Processing Systems, 2001.Google Scholar
- Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Proceedings of Advances in Neural Information Processing Systems, 2001.Google Scholar
- L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In Proceedings of CVPR Workshop of Generative Model Based Vision, 2004. Google ScholarDigital Library
- S. Lazebnik, C. Schmid and J. Ponce. Beyond bags of features, spatial pyramid matching for recognizing natural scene categories. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2006. Google ScholarDigital Library
- N. Snavely, S. Seitz, and R. Szeliski. Photo tourism: exploring photo collections in 3D. In Proceedings of SIGGRAPH, 2006. Google ScholarDigital Library
- Nino Shervashidze, Karsten M. Borgwardt. Fast subtree kernels on graphs. In Proceedings of Advances in Neural Information Processing Systems, 2009.Google Scholar
- N. Wale and G. Karypis. Comparison of descriptor spaces for chemical compound retrieval and classification. In Proceedings of ICDM, 2006 Google ScholarDigital Library
Index Terms
- Scalable similarity search with optimized kernel hashing
Recommendations
Weighted hashing for fast large scale similarity search
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge ManagementSimilarity search, or finding approximate nearest neighbors, is an important technique for many applications. Many recent research demonstrate that hashing methods can achieve promising results for large scale similarity search due to its computational ...
An Examination of Multivariate Time Series Hashing with Applications to Health Care
ICDM '14: Proceedings of the 2014 IEEE International Conference on Data MiningAs large-scale multivariate time series data become increasingly common in application domains, such as health care and traffic analysis, researchers are challenged to build efficient tools to analyze it and provide useful insights. Similarity search, ...
Distributed KNN-graph approximation via hashing
ICMR '12: Proceedings of the 2nd ACM International Conference on Multimedia RetrievalEfficiently constructing the K-Nearest Neighbor Graph (K-NNG) of large and high dimensional datasets is crucial for many applications with feature-rich objects, such as images or other multimedia content. In this paper we investigate the use of high ...
Comments