research-article

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

Authors:
Anshumali Shrivastava

Cornell University, Ithaca, NY, USA

Cornell University, Ithaca, NY, USA
View Profile

,
Ping Li

Rutgers University, Piscataway, NJ, USA

Rutgers University, Piscataway, NJ, USA
View Profile

WWW '15: Proceedings of the 24th International Conference on World Wide WebMay 2015Pages 981–991https://doi.org/10.1145/2736277.2741285

Published:18 May 2015Publication History

WWW '15: Proceedings of the 24th International Conference on World Wide Web

Pages 981–991

ABSTRACT

Minwise hashing (Minhash) is a widely popular indexing scheme in practice. Minhash is designed for estimating set resemblance and is known to be suboptimal in many applications where the desired measure is set overlap (i.e., inner product between binary vectors) or set containment. Minhash has inherent bias towards smaller sets, which adversely affects its performance in applications where such a penalization is not desirable. In this paper, we propose asymmetric minwise hashing ({\em MH-ALSH}), to provide a solution to this well-known problem. The new scheme utilizes asymmetric transformations to cancel the bias of traditional minhash towards smaller sets, making the final ``collision probability'' monotonic in the inner product. Our theoretical comparisons show that, for the task of retrieving with binary inner products, asymmetric minhash is provably better than traditional minhash and other recently proposed hashing algorithms for general inner products. Thus, we obtain an algorithmic improvement over existing approaches in the literature. Experimental evaluations on four publicly available high-dimensional datasets validate our claims. The proposed scheme outperforms, often significantly, other hashing algorithms on the task of near neighbor retrieval with set containment. Our proposal is simple and easy to implement in practice.

References

P. Agrawal, A. Arasu, and R. Kaushik. On indexing error-tolerant set containment. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 927--938. ACM, 2010. Google ScholarDigital Library
A. Andoni and P. Indyk. E2lsh: Exact euclidean locality sensitive hashing. Technical report, 2004.Google Scholar
Y. Bachrach, Y. Finkelstein, R. Gilad-Bachrach, L. Katzir, N. Koenigstein, N. Nice, and U. Paquet. Speeding up the xbox recommender system using a euclidean transformation for inner-product spaces. In RecSys, 2014. Google ScholarDigital Library
Y. Bachrach, E. Porat, and J. S. Rosenschein. Sketching techniques for collaborative filtering. In Proceedings of the 21st International Jont Conference on Artifical Intelligence, IJCAI'09, 2009. Google ScholarDigital Library
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007. Google ScholarDigital Library
A. Z. Broder. On the resemblance and containment of documents. In the Compression and Complexity of Sequences, pages 21--29, Positano, Italy, 1997. Google ScholarDigital Library
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In STOC, pages 327--336, Dallas, TX, 1998. Google ScholarDigital Library
T. Chandra, E. Ie, K. Goldman, T. L. Llinares, J. McFadden, F. Pereira, J. Redstone, T. Shaked, and Y. Singer. Sibyl: a system for large scale machine learning.Google Scholar
O. Chapelle, P. Haffner, and V. N. Vapnik. Support vector machines for histogram-based image classification. IEEE Transactions on Neural Networks, 10(5):1055--1064, 1999. Google ScholarDigital Library
M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380--388, Montreal, Quebec, Canada, 2002. Google ScholarDigital Library
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operatior for similarity joins in data cleaning. In ICDE, 2006. Google ScholarDigital Library
S. Chaudhuri, V. Ganti, and D. Xin. Mining document collections to facilitate accurate approximate entity matching. Proceedings of the VLDB Endowment, 2(1):395--406, 2009. Google ScholarDigital Library
F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan. On compressing social networks. In KDD, pages 219--228, Paris, France, 2009. Google ScholarDigital Library
G. Cormode and S. Muthukrishnan. Space efficient mining of multigraph streams. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 271--282. ACM, 2005. Google ScholarDigital Library
A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In Proceedings of the 16th international conference on World Wide Web, pages 271--280. ACM, 2007. Google ScholarDigital Library
M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokn. Locality-sensitive hashing scheme based on p-stable distributions. In SCG, pages 253 -- 262, Brooklyn, NY, 2004. Google ScholarDigital Library
S. Geva and C. M. De Vries. Topsig: Topology preserving document signatures. In CIKM, pages 333--338, 2011. Google ScholarDigital Library
M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of ACM, 42(6):1115--1145, 1995. Google ScholarDigital Library
B. Haeupler, M. Manasse, and K. Talwar. Consistent weighted sampling made fast, small, and easy. Technical report, arXiv:1410.4266, 2014.Google Scholar
M. Hein and O. Bousquet. Hilbertian metrics and positive definite kernels on probability measures. In AISTATS, pages 136--143, Barbados, 2005.Google Scholar
M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, pages 284--291, 2006. Google ScholarDigital Library
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pages 604--613, Dallas, TX, 1998. Google ScholarDigital Library
S. Ioffe. Improved consistent sampling, weighted minhash and L1 sketching. In ICDM, pages 246--255, Sydney, AU, 2010. Google ScholarDigital Library
Y. Jiang, C. Ngo, and J. Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In CIVR, pages 494--501, Amsterdam, Netherlands, 2007. Google ScholarDigital Library
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 802--803. ACM, 2006. Google ScholarDigital Library
P. Li. Min-max kernels. Technical report, arXiv:1503. 0173, 2015.Google Scholar
P. Li and A. C. König. Theory and applications b-bit minwise hashing. Commun. ACM, 2011. Google ScholarDigital Library
P. Li, M. Mitzenmacher, and A. Shrivastava. Coding for random projections and approximate near neighbor search. Technical report, arXiv:1403.8144, 2014.Google Scholar
P. Li, A. B. Owen, and C.-H. Zhang. One permutation hashing. In NIPS, Lake Tahoe, NV, 2012.Google ScholarDigital Library
M. Manasse, F. McSherry, and K. Talwar. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010.Google Scholar
S. Melnik and H. Garcia-Molina. Adaptive algorithms for set containment joins. ACM Transactions on Database Systems (TODS), 28(1):56--99, 2003. Google ScholarDigital Library
B. Neyshabur and N. Srebro. A simpler and better lsh for maximum inner product search (mips). Technical report, arXiv:1410.5518, 2014.Google Scholar
A. Rajaraman and J. Ullman. Mining of Massive Datasets. http://i.stanford.edu/ ullman/mmds.html. Google ScholarDigital Library
K. Ramasamy, J. F. Naughton, and R. Kaushik. Set containment joins: The good, the bad and the ugly.Google Scholar
A. Shrivastava and P. Li. Asymmetric LSH (ALSH) for sublinear time maximum inner product search (mips). In NIPS, Montreal, CA, 2014.Google ScholarDigital Library
A. Shrivastava and P. Li. Densifying one permutation hashing via rotation for fast near neighbor search. In ICML, Beijing, China, 2014.Google ScholarDigital Library
A. Shrivastava and P. Li. Improved asymmetric locality sensitive hashing (ALSH) for maximum inner product search (MIPS). arXiv:1410.5410 (submitted to AISTATS), 2014.Google Scholar
A. Shrivastava and P. Li. Improved densification of one permutation hashing. In UAI, Quebec City, CA, 2014.Google ScholarDigital Library
A. Shrivastava and P. Li. In defense of minhash over simhash. In AISTATS, 2014.Google Scholar
R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB, pages 194--205, 1998. Google ScholarDigital Library

Index Terms

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment
1. Information systems
  1. Information retrieval

Recommendations

Accurate and Fast Asymmetric Locality-Sensitive Hashing Scheme for Maximum Inner Product Search
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

The problem of Approximate Maximum Inner Product (AMIP) search has received increasing attention due to its wide applications. Interestingly, based on asymmetric transformation, the problem can be reduced to the Approximate Nearest Neighbor (ANN) search,...
Read More
Learning Sparse Binary Code for Maximum Inner Product Search
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Maximum inner product search (MIPS), combined with the hashing method, has become a standard solution to similarity search problems. It often achieves an order of magnitude speedup over nearest neighbor search (NNS) under similar settings. Motivated by ...
Read More
GPU-based minwise hashing: GPU-based minwise hashing
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web

Minwise hashing is a standard technique for efficient set similarity estimation in the context of search. The recent work of b-bit minwise hashing provided a substantial improvement by storing only the lowest b bits of each hashed value. Both minwise ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '15: Proceedings of the 24th International Conference on World Wide Web
May 2015
1460 pages
ISBN:9781450334693
General Chairs:
Aldo Gangemi
National Research Council, Italy & Paris 13 University-CNRS, France
,
Stefano Leonardi
Sapienza University of Rome, Italy
,
Alessandro Panconesi
Sapienza University of Rome, Italy
Copyright © 2015 Copyright is held by the International World Wide Web Conference Committee (IW3C2)
Sponsors
In-Cooperation
Publisher
International World Wide Web Conferences Steering Committee
Republic and Canton of Geneva, Switzerland
Publication History
- Published: 18 May 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
asymmetric minwise hashing
indexing
maximum inner product search
Qualifiers
- research-article
Conference

Acceptance Rates
WWW '15 Paper Acceptance Rate131of929submissions,14%Overall Acceptance Rate1,899of8,196submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 47
  Total Citations
  View Citations
- 370
  Total Downloads
- Downloads (Last 12 months)19
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

WWW '15: Proceedings of the 24th International Conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Accurate and Fast Asymmetric Locality-Sensitive Hashing Scheme for Maximum Inner Product Search

Learning Sparse Binary Code for Maximum Inner Product Search

GPU-based minwise hashing: GPU-based minwise hashing