research-article

Free Access

Theory and applications of b-bit minwise hashing

Authors:
Ping Li

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

,
Arnd Christian König

Microsoft Research, Microsoft Corporation, Redmond, WA

Microsoft Research, Microsoft Corporation, Redmond, WA
View Profile

Authors Info & Claims

Communications of the ACM Volume 54 Issue 8August 2011pp 101–109https://doi.org/10.1145/1978542.1978566

Published:01 August 2011Publication History

Communications of the ACM

References

Andoni, A., Indyk, P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51 (2008), 117--122. Google ScholarDigital Library
Broder, A.Z. On the resemblance and containment of documents. In The Compression and Complexity of Sequences (Positano, Italy, 1997), 21--29. Google ScholarDigital Library
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M. Min-wise independent permutations. J. Comput. Syst. Sci. 60, 3 (2000), 630--659. Google ScholarDigital Library
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G. Syntactic clustering of the web. In WWW (Santa Clara, CA, 1997), 1157--1166. Google ScholarDigital Library
Charikar, M.S. Similarity estimation techniques from rounding algorithms. In STOC (Montreal, Quebec, Canada, 2002), 380--388. Google ScholarDigital Library
Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J.D., Yang, C. Finding interesting associations without support pruning. IEEE Trans. Knowl. Data Eng. 13, 1 (2001), 64--78. Google ScholarDigital Library
Fetterly, D., Manasse, M., Najork, M., Wiener, J.L. A large-scale study of the evolution of web pages. In WWW (Budapest, Hungary, 2003), 669--678. Google ScholarDigital Library
Forman, G., Eshghi, K., Suermondt, J. Efficient detection of large-scale redundancy in enterprise file systems. SIGOPS Oper. Syst. Rev. 43, 1 (2009), 84--91. Google ScholarDigital Library
Gamon, M., Basu, S., Belenko, D., Fisher, D., Hurst, M., König, A.C. Blews: Using blogs to provide context for news articles. In AAAI Conference on Weblogs and Social Media (Redmond, WA, 2008).Google Scholar
Gionis, A., Gunopulos, D., Koudas, N. Efficient and tunable similar set retrieval. In SIGMOD (Santa Barbara, CA, 2001), 247--258. Google ScholarDigital Library
Goemans, M.X., Williamson, D.P. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM 42, 6 (1995), 1115--1145. Google ScholarDigital Library
Indyk, P. A small approximately min-wise independent family of hash functions. J. Algorithms 38, 1 (2001), 84--90. Google ScholarDigital Library
Indyk, P., Motwani, R. Approximate nearest neighbors: towards removing the curse of dimensionality. In STOC (Dallas, TX, 1998), 604--613. Google ScholarDigital Library
Itoh, T., Takei, Y., Tarui, J. On the sample size of k-restricted min-wise independent permutations and other k-wise distributions. In STOC (San Diego, CA, 2003), 710--718. Google ScholarDigital Library
Kushilevitz, E., Ostrovsky, R., Rabani, Y. Efficient search for approximate nearest neighbor in high dimensional spaces. In STOC (Dallas, TX, 1998), 614--623. Google ScholarDigital Library
Li, P., Church, K.W. A sketch algorithm for estimating two-way and multi-way associations. Comput. Linguist. 33, 3 (2007), 305--354 (Preliminary results appeared in HLT/EMNLP 2005). Google ScholarDigital Library
Li, P., Church, K.W., Hastie, T.J. One sketch for all: Theory and applications of conditional random sampling. In NIPS (Vancouver, British Columbia, Canada, 2008) (Preliminary results appeared in NIPS 2006).Google Scholar
Li, P., Hastie, T.J., Church, K.W. Improving random projections using marginal information. In COLT (Pittsburgh, PA, 2006), 635--649. Google ScholarDigital Library
Li, P., König, A.C., Gui, W. b-Bit minwise hashing for estimating three-way similarities. In NIPS (Vancouver, British Columbia, Canada, 2010).Google Scholar
Li, P., Moore, J., König, A.C. b-Bit minwise hashing for large-scale linear SVM. Technical report, 2011. http://www.stat.cornell.edu/~li/b-bit-hashing/HashingSVM.pdfGoogle Scholar
Cherkasova, L., Eshghi, K., Morrey III, C.B., Tucek, J., Veitch, A. Applying Syntactic similarity algorithms for enterprise information management. In KDD (Paris, France, 2009), 1087--1096. Google ScholarDigital Library
Manasse, M., McSherry, F., Talwar, K. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010.Google Scholar
Pandey, S., Broder, A., Chierichetti, F., Josifovski, V., Kumar, R., Vassilvitskii, S. Nearest-neighbor caching for content-match applications. In WWW (Madrid, Spain, 2009), 441--450. Google ScholarDigital Library
Rajaraman, A., Ullman, J. Mining of Massive Datasets. http://i.stanford.edu/ullman/mmds.htmlGoogle Scholar
Urvoy, T., Chauveau, E., Filoche, P., Lavergne, T. Tracking web spam with html style similarities. ACM Trans. Web 2, 1 (2008), 1--28. Google ScholarDigital Library

Index Terms

Theory and applications of b-bit minwise hashing
1. Information systems
  1. Information systems applications

Recommendations

b-Bit minwise hashing
WWW '10: Proceedings of the 19th international conference on World wide web

This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating set similarity (e.g., resemblance) with applications in information retrieval, data management, ...
Read More
b-bit minwise hashing in practice
Internetware '13: Proceedings of the 5th Asia-Pacific Symposium on Internetware

Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [26, 32] demonstrated a potential use of b-bit minwise hashing [23, 24] for efficient search and learning on massive, high-dimensional, ...
Read More
GPU-based minwise hashing: GPU-based minwise hashing
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web

Minwise hashing is a standard technique for efficient set similarity estimation in the context of search. The recent work of b-bit minwise hashing provided a substantial improvement by storing only the lowest b bits of each hashed value. Both minwise ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Communications of the ACM Volume 54, Issue 8
August 2011
129 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/1978542
Issue’s Table of Contents

Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 August 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Popular
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 56
  Total Citations
  View Citations
- 16,240
  Total Downloads
- Downloads (Last 12 months)217
- Downloads (Last 6 weeks)32
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Theory and applications of b-bit minwise hashing

Communications of the ACM

References

Cited By

Index Terms

Recommendations

b-Bit minwise hashing

b-bit minwise hashing in practice

GPU-based minwise hashing: GPU-based minwise hashing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Theory and applications of b-bit minwise hashing

Communications of the ACM

References

Cited By

Index Terms

Recommendations

b-Bit minwise hashing

b-bit minwise hashing in practice

GPU-based minwise hashing: GPU-based minwise hashing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media