skip to main content
10.1145/1807085.1807094acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

An optimal algorithm for the distinct elements problem

Published:06 June 2010Publication History

ABSTRACT

We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. This problem has applications to query optimization, Internet routing, network topology, and data mining. For a stream of indices in {1,...,n}, our algorithm computes a (1 ± ε)-approximation using an optimal O(1/ε-2 + log(n)) bits of space with 2/3 success probability, where 0<ε<1 is given. This probability can be amplified by independent repetition. Furthermore, our algorithm processes each stream update in O(1) worst-case time, and can report an estimate at any point midstream in O(1) worst-case time, thus settling both the space and time complexities simultaneously.

We also give an algorithm to estimate the Hamming norm of a stream, a generalization of the number of distinct elements, which is useful in data cleaning, packet tracing, and database auditing. Our algorithm uses nearly optimal space, and has optimal O(1) update and reporting times.

References

  1. S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The Aqua approximate query answering system. In SIGMOD, pages 574--576, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Akella, A. Bharambe, M. Reiter, and S. Seshan. Detecting DDoS attacks on ISP networks. In Proc. MPDS, 2003.Google ScholarGoogle Scholar
  3. N. Alon, Y. Matias, and M. Szegedy. The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci., 58(1):137--147, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Z. Bar-Yossef, T.S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In Proc. RANDOM, pages 1--10, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proc. SODA, pages 623--632, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. S. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In SIGMOD, pages 199--210, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. K. Blandford and G. E. Blelloch. Compact dictionaries for variable-length keys and data with applications. ACM Trans. Alg., 4(2), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Brodnik. Computation of the least significant set bit. In Proc. ERK, 1993.Google ScholarGoogle Scholar
  9. J. Brody and A. Chakrabarti. A multi-round communication lower bound for gap hamming and some consequences. In Proc. CCC, pages 358--368, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Brown, P. J. Haas, J. Myllymaki, H. Pirahesh, B. Reinwald, and Y. Sismanis. Toward automated large-scale information integration and discovery. In Data Management in a Connected World, pages 161--180, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. L. Carter and M. N. Wegman. Universal classes of hash functions. J. Comput. Syst. Sci., 18(2):143--154, 1979.Google ScholarGoogle ScholarCross RefCross Ref
  12. E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci., 55(3):441--453, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan. Comparing data streams using hamming norms (how to zero in). IEEE Trans. Knowl. Data Eng., 15(3):529--540, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In SIGMOD, pages 240--251, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. P. Dubhashi and D. Ranjan. Balls and bins: A study in negative dependence. Random Struct. Algorithms, 13(2):99--124, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Durand and P. Flajolet. Loglog counting of large cardinalities (extended abstract). In Proc. ESA, pages 605--617, 2003.Google ScholarGoogle Scholar
  17. C. Estan, G. Varghese, and M. E. Fisk. Bitmap algorithms for counting active flows on high-speed links. IEEE/ACM Trans. Netw., 14(5):925--937, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. J. Finkelstein, M. Schkolnick, and P. Tiberio. Physical database design for relational databases. ACM Trans. Database Syst., 13(1):91--128, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. Disc. Math. and Theor. Comp. Sci., AH:127--146, 2007.Google ScholarGoogle Scholar
  20. P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182--209, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. L. Fredman and D. E. Willard. Surpassing the information theoretic bound with fusion trees. J. Comput. Syst. Sci., 47(3):424--436, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Ganguly. Counting distinct items over update streams. Theor. Comput. Sci., 378(3):211--222, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB, pages 541--550, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. B. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In Proc. SPAA, pages 281--291, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. P. Indyk. Algorithms for dynamic geometric problems over data streams. In Proc. STOC, pages 373--380, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. P. Indyk and D. P. Woodruff. Tight lower bounds for the distinct elements problem. In Proc. FOCS, pages 283---, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. M. Kane, J. Nelson, and D. P. Woodruff. On the exact space complexity of sketching and streaming small norms. In Proc. SODA, pages 1161--1178, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google ScholarGoogle ScholarCross RefCross Ref
  29. M. H. Overmars. The Design of Dynamic Data Structures. Springer, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Padmanabhan, B. Bhattacharjee, T. Malkemus, L. Cranston, and M. Huras. Multi-dimensional clustering: A new data layout scheme in db2. In SIGMOD, pages 637--641, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Pagh and R. Pagh. Uniform hashing in constant time and optimal space. SIAM J. Comput., 38(1):85--96, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. C. R. Palmer, G. Siganos, M. Faloutsos, and C. Faloutsos. The connectivity and fault-tolerance of the internet topology. In NRDM Workshop, 2001.Google ScholarGoogle Scholar
  33. P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In SIGMOD, pages 23--34, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Shukla, P. Deshpande, J. F. Naughton, and K. Ramasamy. Storage estimation for multidimensional aggregates in the presence of hierarchies. In Proc. VLDB, pages 522--531, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. Siegel. On universal classes of extremely random constant-time hash functions. SIAM J. Computing, 33(3):505--543, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. D. P. Woodruff. Optimal space lower bounds for all frequency moments. In Proc. SODA, pages 167--175, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An optimal algorithm for the distinct elements problem

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PODS '10: Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
        June 2010
        350 pages
        ISBN:9781450300339
        DOI:10.1145/1807085

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 6 June 2010

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        PODS '10 Paper Acceptance Rate27of113submissions,24%Overall Acceptance Rate642of2,707submissions,24%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader