skip to main content
10.1145/1835804.1835946acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Scalable similarity search with optimized kernel hashing

Published:25 July 2010Publication History

ABSTRACT

Scalable similarity search is the core of many large scale learning or data mining applications. Recently, many research results demonstrate that one promising approach is creating compact and efficient hash codes that preserve data similarity. By efficient, we refer to the low correlation (and thus low redundancy) among generated codes. However, most existing hash methods are designed only for vector data. In this paper, we develop a new hashing algorithm to create efficient codes for large scale data of general formats with any kernel function, including kernels on vectors, graphs, sequences, sets and so on. Starting with the idea analogous to spectral hashing, novel formulations and solutions are proposed such that a kernel based hash function can be explicitly represented and optimized, and directly applied to compute compact hash codes for new samples of general formats. Moreover, we incorporate efficient techniques, such as Nystrom approximation, to further reduce time and space complexity for indexing and search, making our algorithm scalable to huge data sets. Another important advantage of our method is the ability to handle diverse types of similarities according to actual task requirements, including both feature similarities and semantic similarities like label consistency. We evaluate our method using both vector and non-vector data sets at a large scale up to 1 million samples. Our comprehensive results show the proposed method outperforms several state-of-the-art approaches for all the tasks, with a significant gain for most tasks.

Skip Supplemental Material Section

Supplemental Material

kdd2010_he_sss_01.mov

mov

121.2 MB

References

  1. J. Freidman, J. Bentley, and A. Finkel. An Algorithm for Finding BestMatches in Logarithmic Expected Time. ACM Transactions on Mathematical Software, page 209--226, 1977. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Uhlmann. Satisfying general proximity / similarity queries with metric trees. Information Processing Letters, page 175--179, 1991.Google ScholarGoogle Scholar
  3. P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of 30th Symposium on Theory of Computing (STOC), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Gionis, P. Indyk, and R.Motwani. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Charikar. Similarity search in high dimensions via hashing. In Proceedings of the ACM Symposium on Theory of Computing, 2002.Google ScholarGoogle Scholar
  6. M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. Locality sensitive hashing scheme based on p-Stable distributions. In Proceedings of Symposium on Computational Geometry (SOCG), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. Grauman and T. Darrell. Pyramid match hashing: sub-Linear time indexing over partial correspondences. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2007.Google ScholarGoogle Scholar
  8. P. Jain, B. Kulis, and K. Grauman. Fast image search for learned metrics. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2008.Google ScholarGoogle Scholar
  9. R. Salakhutdinov and G. Hinton. Semantic hashing. In Proceedings of ACM SIGIR Special Interest Group on Information Retrieval, 2007.Google ScholarGoogle Scholar
  10. R. Salakhutdinov and G. Hinton. Learning a nonlinear embedding by preserving class neighbourhood structure. In Proceedings of AI and Statistics, 2007.Google ScholarGoogle Scholar
  11. A. Torralba, R. Fergus, and Y.Weiss. Small Codes and Large Image Databases for Recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  12. Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Proceedings of Advances in Neural Information Processing Systems, 2008.Google ScholarGoogle Scholar
  13. Brian Kulis and Kristen Grauman. Kernelized Locality-Sensitive Hashing for Scalable Image Search. In Proceedings of 12th International Conference on Computer Vision, 2009.Google ScholarGoogle Scholar
  14. Brian Kulis and Trevor Darrell. Learning to hash with binary reconstructive embeddings. In Proceedings of Advances in Neural Information Processing Systems, 2009.Google ScholarGoogle Scholar
  15. M. Raginsky and S. Lazebnik. Locality sensitive binary codes from shift-invariant kernels. In Proceedings of Advances in Neural Information Processing Systems, 2009.Google ScholarGoogle Scholar
  16. Kave Eshghi and Shyamsundar Rajaram. Locality sensitive hash functions based on concomitant rank order statistics. In Proceedings of ACM SIGKDD conference on Knowledge Discovery and Data Mining, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Xiaofei He, and Partha Niyogi. Locality Preserving Projections. In Proceedings of Advances in Neural Information Processing Systems, 2003.Google ScholarGoogle Scholar
  18. A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal on Computer Vision, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal on Computer Vision, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Vladimir N Vapnik. The Nature of Statistical Learning Theory, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Christopher Williams, Matthias Seeger. Using the nystr¨om method to speed up kernel machines. In Proceedings of Advances in Neural Information Processing Systems, 2001.Google ScholarGoogle Scholar
  22. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Proceedings of Advances in Neural Information Processing Systems, 2001.Google ScholarGoogle Scholar
  23. L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In Proceedings of CVPR Workshop of Generative Model Based Vision, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Lazebnik, C. Schmid and J. Ponce. Beyond bags of features, spatial pyramid matching for recognizing natural scene categories. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. N. Snavely, S. Seitz, and R. Szeliski. Photo tourism: exploring photo collections in 3D. In Proceedings of SIGGRAPH, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Nino Shervashidze, Karsten M. Borgwardt. Fast subtree kernels on graphs. In Proceedings of Advances in Neural Information Processing Systems, 2009.Google ScholarGoogle Scholar
  27. N. Wale and G. Karypis. Comparison of descriptor spaces for chemical compound retrieval and classification. In Proceedings of ICDM, 2006 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scalable similarity search with optimized kernel hashing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
      July 2010
      1240 pages
      ISBN:9781450300551
      DOI:10.1145/1835804

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 July 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader