ABSTRACT
Minwise hashing (Minhash) is a widely popular indexing scheme in practice. Minhash is designed for estimating set resemblance and is known to be suboptimal in many applications where the desired measure is set overlap (i.e., inner product between binary vectors) or set containment. Minhash has inherent bias towards smaller sets, which adversely affects its performance in applications where such a penalization is not desirable. In this paper, we propose asymmetric minwise hashing ({\em MH-ALSH}), to provide a solution to this well-known problem. The new scheme utilizes asymmetric transformations to cancel the bias of traditional minhash towards smaller sets, making the final ``collision probability'' monotonic in the inner product. Our theoretical comparisons show that, for the task of retrieving with binary inner products, asymmetric minhash is provably better than traditional minhash and other recently proposed hashing algorithms for general inner products. Thus, we obtain an algorithmic improvement over existing approaches in the literature. Experimental evaluations on four publicly available high-dimensional datasets validate our claims. The proposed scheme outperforms, often significantly, other hashing algorithms on the task of near neighbor retrieval with set containment. Our proposal is simple and easy to implement in practice.
- P. Agrawal, A. Arasu, and R. Kaushik. On indexing error-tolerant set containment. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 927--938. ACM, 2010. Google ScholarDigital Library
- A. Andoni and P. Indyk. E2lsh: Exact euclidean locality sensitive hashing. Technical report, 2004.Google Scholar
- Y. Bachrach, Y. Finkelstein, R. Gilad-Bachrach, L. Katzir, N. Koenigstein, N. Nice, and U. Paquet. Speeding up the xbox recommender system using a euclidean transformation for inner-product spaces. In RecSys, 2014. Google ScholarDigital Library
- Y. Bachrach, E. Porat, and J. S. Rosenschein. Sketching techniques for collaborative filtering. In Proceedings of the 21st International Jont Conference on Artifical Intelligence, IJCAI'09, 2009. Google ScholarDigital Library
- R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007. Google ScholarDigital Library
- A. Z. Broder. On the resemblance and containment of documents. In the Compression and Complexity of Sequences, pages 21--29, Positano, Italy, 1997. Google ScholarDigital Library
- A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In STOC, pages 327--336, Dallas, TX, 1998. Google ScholarDigital Library
- T. Chandra, E. Ie, K. Goldman, T. L. Llinares, J. McFadden, F. Pereira, J. Redstone, T. Shaked, and Y. Singer. Sibyl: a system for large scale machine learning.Google Scholar
- O. Chapelle, P. Haffner, and V. N. Vapnik. Support vector machines for histogram-based image classification. IEEE Transactions on Neural Networks, 10(5):1055--1064, 1999. Google ScholarDigital Library
- M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380--388, Montreal, Quebec, Canada, 2002. Google ScholarDigital Library
- S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operatior for similarity joins in data cleaning. In ICDE, 2006. Google ScholarDigital Library
- S. Chaudhuri, V. Ganti, and D. Xin. Mining document collections to facilitate accurate approximate entity matching. Proceedings of the VLDB Endowment, 2(1):395--406, 2009. Google ScholarDigital Library
- F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan. On compressing social networks. In KDD, pages 219--228, Paris, France, 2009. Google ScholarDigital Library
- G. Cormode and S. Muthukrishnan. Space efficient mining of multigraph streams. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 271--282. ACM, 2005. Google ScholarDigital Library
- A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In Proceedings of the 16th international conference on World Wide Web, pages 271--280. ACM, 2007. Google ScholarDigital Library
- M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokn. Locality-sensitive hashing scheme based on p-stable distributions. In SCG, pages 253 -- 262, Brooklyn, NY, 2004. Google ScholarDigital Library
- S. Geva and C. M. De Vries. Topsig: Topology preserving document signatures. In CIKM, pages 333--338, 2011. Google ScholarDigital Library
- M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of ACM, 42(6):1115--1145, 1995. Google ScholarDigital Library
- B. Haeupler, M. Manasse, and K. Talwar. Consistent weighted sampling made fast, small, and easy. Technical report, arXiv:1410.4266, 2014.Google Scholar
- M. Hein and O. Bousquet. Hilbertian metrics and positive definite kernels on probability measures. In AISTATS, pages 136--143, Barbados, 2005.Google Scholar
- M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, pages 284--291, 2006. Google ScholarDigital Library
- P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pages 604--613, Dallas, TX, 1998. Google ScholarDigital Library
- S. Ioffe. Improved consistent sampling, weighted minhash and L1 sketching. In ICDM, pages 246--255, Sydney, AU, 2010. Google ScholarDigital Library
- Y. Jiang, C. Ngo, and J. Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In CIVR, pages 494--501, Amsterdam, Netherlands, 2007. Google ScholarDigital Library
- N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 802--803. ACM, 2006. Google ScholarDigital Library
- P. Li. Min-max kernels. Technical report, arXiv:1503. 0173, 2015.Google Scholar
- P. Li and A. C. König. Theory and applications b-bit minwise hashing. Commun. ACM, 2011. Google ScholarDigital Library
- P. Li, M. Mitzenmacher, and A. Shrivastava. Coding for random projections and approximate near neighbor search. Technical report, arXiv:1403.8144, 2014.Google Scholar
- P. Li, A. B. Owen, and C.-H. Zhang. One permutation hashing. In NIPS, Lake Tahoe, NV, 2012.Google ScholarDigital Library
- M. Manasse, F. McSherry, and K. Talwar. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010.Google Scholar
- S. Melnik and H. Garcia-Molina. Adaptive algorithms for set containment joins. ACM Transactions on Database Systems (TODS), 28(1):56--99, 2003. Google ScholarDigital Library
- B. Neyshabur and N. Srebro. A simpler and better lsh for maximum inner product search (mips). Technical report, arXiv:1410.5518, 2014.Google Scholar
- A. Rajaraman and J. Ullman. Mining of Massive Datasets. http://i.stanford.edu/ ullman/mmds.html. Google ScholarDigital Library
- K. Ramasamy, J. F. Naughton, and R. Kaushik. Set containment joins: The good, the bad and the ugly.Google Scholar
- A. Shrivastava and P. Li. Asymmetric LSH (ALSH) for sublinear time maximum inner product search (mips). In NIPS, Montreal, CA, 2014.Google ScholarDigital Library
- A. Shrivastava and P. Li. Densifying one permutation hashing via rotation for fast near neighbor search. In ICML, Beijing, China, 2014.Google ScholarDigital Library
- A. Shrivastava and P. Li. Improved asymmetric locality sensitive hashing (ALSH) for maximum inner product search (MIPS). arXiv:1410.5410 (submitted to AISTATS), 2014.Google Scholar
- A. Shrivastava and P. Li. Improved densification of one permutation hashing. In UAI, Quebec City, CA, 2014.Google ScholarDigital Library
- A. Shrivastava and P. Li. In defense of minhash over simhash. In AISTATS, 2014.Google Scholar
- R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB, pages 194--205, 1998. Google ScholarDigital Library
Index Terms
- Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment
Recommendations
Accurate and Fast Asymmetric Locality-Sensitive Hashing Scheme for Maximum Inner Product Search
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningThe problem of Approximate Maximum Inner Product (AMIP) search has received increasing attention due to its wide applications. Interestingly, based on asymmetric transformation, the problem can be reduced to the Approximate Nearest Neighbor (ANN) search,...
Learning Sparse Binary Code for Maximum Inner Product Search
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge ManagementMaximum inner product search (MIPS), combined with the hashing method, has become a standard solution to similarity search problems. It often achieves an order of magnitude speedup over nearest neighbor search (NNS) under similar settings. Motivated by ...
GPU-based minwise hashing: GPU-based minwise hashing
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide WebMinwise hashing is a standard technique for efficient set similarity estimation in the context of search. The recent work of b-bit minwise hashing provided a substantial improvement by storing only the lowest b bits of each hashed value. Both minwise ...
Comments