skip to main content
10.1145/3035918.3035963acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections

A General-Purpose Counting Filter: Making Every Bit Count

Published:09 May 2017Publication History

ABSTRACT

Approximate Membership Query (AMQ) data structures, such as the Bloom filter, quotient filter, and cuckoo filter, have found numerous applications in databases, storage systems, networks, computational biology, and other domains. However, many applications must work around limitations in the capabilities or performance of current AMQs, making these applications more complex and less performant. For example, many current AMQs cannot delete or count the number of occurrences of each input item, take up large amounts of space, are slow, cannot be resized or merged, or have poor locality of reference and hence perform poorly when stored on SSD or disk. This paper proposes a new general-purpose AMQ, the counting quotient filter (CQF). The CQF supports approximate membership testing and counting the occurrences of items in a data set. This general-purpose AMQ is small and fast, has good locality of reference, scales out of RAM to SSD, and supports deletions, counting (even on skewed data sets), resizing, merging, and highly concurrent access. The paper reports on the structure's performance on both manufactured and application-generated data sets.

In our experiments, the CQF performs in-memory inserts and queries up to an order-of magnitude faster than the original quotient filter, several times faster than a Bloom filter, and similarly to the cuckoo filter, even though none of these other data structures support counting. On SSD, the CQF outperforms all structures by a factor of at least 2 because the CQF has good data locality.

The CQF achieves these performance gains by restructuring the metadata bits of the quotient filter to obtain fast lookups at high load factors (i.e., even when the data structure is almost full). As a result, the CQF offers good lookup performance even up to a load factor of 95%. Counting is essentially free in the CQF in the sense that the structure is comparable or more space efficient even than non-counting data structures (e.g., Bloom, quotient, and cuckoo filters).

The paper also shows how to speed up CQF operations by using new x86 bit-manipulation instructions introduced in Intel's Haswell line of processors. The restructured metadata transforms many quotient filter metadata operations into rank-and-select bit-vector operations. Thus, our efficient implementations of rank and select may be useful for other rank-and-select-based data structures.

References

  1. F. vesca genome read dataset. ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA020/SRA020125/SRX030576/SRR072006.fastq.bz2. {Online; accessed 19-February-2016}.Google ScholarGoogle Scholar
  2. P. K. Agarwal, G. Cormode, Z. Huang, J. M. Phillips, Z. Wei, and K. Yi. Mergeable summaries. ACM Transactions on Database Systems (TODS), 38(4):26, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. S. Almeida, C. Baquero, N. Preguiça, and D. Hutchison. Scalable Bloom filters. Journal of Information Processing Letters, 101(6):255--261, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Alsubaiee, A. Behm, V. Borkar, Z. Heilbron, Y.-S. Kim, M. J. Carey, M. Dreseler, and C. Li. Storage management in AsterixDB. Proceedings of the VLDB Endowment, 7(10):841--852, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. A. Bender, M. Farach-Colton, R. Johnson, R. Kaner, B. C. Kuszmaul, D. Medjedovic, P. Montes, P. Shetty, R. P. Spillane, and E. Zadok. Don't thrash: How to cache your hash on flash. Proceedings of the VLDB Endowment, 5(11), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. A. Bender, M. Farach-Colton, and M. A. Mosteiro. Insertion sort is O(n log n). Theory of Computing Systems, 39(3):391--397, 2006. Special Issue on FUN '04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese. An improved construction for counting Bloom filters. In European Symposium on Algorithms (ESA), pages 684--695. Springer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Broder and M. Mitzenmacher. Network applications of Bloom filters: A survey. Internet Mathematics, 1(4):485--509, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  10. M. Canim, G. A. Mihaila, B. Bhattacharjee, C. A. Lang, and K. A. Ross. Buffered Bloom filters on solid state storage. In Proceedings of the International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures (ADMS), pages 1--8, 2010.Google ScholarGoogle Scholar
  11. S. Cohen and Y. Matias. Spectral Bloom filters. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), pages 241--252, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. Corominas-Murtra and R. V. Solé. Universality of Zipf's law. Physical Review E, 82(1):011102, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  14. B. Debnath, S. Sengupta, J. Li, D. J. Lilja, and D. H. Du. BloomFlash: Bloom filter on flash-based storage. In Proceedings of the 31st International Conference on Distributed Computing Systems (ICDCS), pages 635--644, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. K. Debnath, S. Sengupta, and J. Li. Chunkstash: Speeding up inline storage deduplication using flash memory. In Proceedings of the USENIX Annual Technical Conference (ATC), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Fan. Cuckoo filter source code in C++. https://github.com/efficient/cuckoofilter, 2014. {Online; accessed 19-July-2014}.Google ScholarGoogle Scholar
  17. B. Fan, D. G. Andersen, M. Kaminsky, and M. D. Mitzenmacher. Cuckoo filter: Practically better than Bloom. In Proceedings of the 10th ACM International on Conference on Emerging Networking Experiments and Technologies, pages 75--88, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: A scalable wide-area web cache sharing protocol. IEEE/ACM Transactions on Networking (TON), 8(3):281--293, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Geil. Quotient filters: Approximate membership queries on the GPU. http://on-demand.gputechconf.com/gtc/2016/presentation/s6464-afton-geil-quoetient-filters.pdf, 2016.Google ScholarGoogle Scholar
  20. E. Georganas, A. Buluç, J. Chapman, L. Oliker, D. Rokhsar, and K. Yelick. Parallel de Bruijn graph construction and traversal for de novo genome assembly. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 437--448, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. González, S. Grabowski, V. Mäkinen, and G. Navarro. Practical implementation of rank and select queries. In Poster Proceedings Volume of 4th Workshop on Efficient and Experimental Algorithms (WEA), pages 27--38, 2005.Google ScholarGoogle Scholar
  22. S. P. Karl Anderson. Firehose. http://firehose.sandia.gov/, 2013. {Online; accessed 19-Dec-2015}.Google ScholarGoogle Scholar
  23. G. Lu, B. Debnath, and D. H. Du. A forest-structured Bloom filter with flash memory. In Proceedings of the 27th Symposium on Mass Storage Systems and Technologies (MSST), pages 1--6, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. Melsted and J. K. Pritchard. Efficient counting of k-mers in DNA sequences using a Bloom filter. BMC Bioinformatics, 12(1):1, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  25. P. O'Neil, E. Cheng, D. Gawlic, and E. O'Neil. The log-structured merge-tree (LSM-tree). Acta Informatica, 33(4):351--385, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Partow. C++Bloom filter library. https://code.google.com/p/bloom/. {Online; accessed 19-July-2014}.Google ScholarGoogle Scholar
  27. F. Putze, P. Sanders, and J. Singler. Cache-, hash-and space-efficient bloom filters. In International Workshop on Experimental and Efficient Algorithms, pages 108--121, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Qiao, T. Li, and S. Chen. Fast Bloom filters and their generalization. IEEE Transactions on Parallel and Distributed Systems (TPDS), 25(1):93--103, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. S. Roy, D. Bhattacharya, and A. Schliep. Turtle: Identifying frequent k-mers with cache-efficient algorithms. Bioinformatics, 30:1950--1957, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  30. S. Tarkoma, C. E. Rothenberg, and E. Lagerspetz. Theory and practice of Bloom filters for distributed systems. IEEE Communications Surveys & Tutorials, 14(1):131--155, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  31. M. Vallentin. Counting Bloom filter source code in C++. https://github.com/mavam/libbf, 2014. {Online; accessed 19-July-2015}.Google ScholarGoogle Scholar
  32. P. Wang, G. Sun, S. Jiang, J. Ouyang, S. Lin, C. Zhang, and J. Cong. An efficient design and implementation of LSM-tree based key-value store on open-channel SSD. In Proceedings of the 9th European Conference on Computer Systems (EuroSys), pages 16:1--16:14, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Q. Zhang, J. Pell, R. Canino-Koning, A. C. Howe, and C. T. Brown. These are not the k-mers you are looking for: Efficient online k-mer counting using a probabilistic data structure. PLoS One, 9(7):e101271, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  34. B. Zhu, K. Li, and R. H. Patterson. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST), pages 1--14, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A General-Purpose Counting Filter: Making Every Bit Count

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Conferences
                  SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
                  May 2017
                  1810 pages
                  ISBN:9781450341974
                  DOI:10.1145/3035918

                  Copyright © 2017 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 9 May 2017

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article

                  Acceptance Rates

                  Overall Acceptance Rate785of4,003submissions,20%

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader