ABSTRACT
We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. This problem has applications to query optimization, Internet routing, network topology, and data mining. For a stream of indices in {1,...,n}, our algorithm computes a (1 ± ε)-approximation using an optimal O(1/ε-2 + log(n)) bits of space with 2/3 success probability, where 0<ε<1 is given. This probability can be amplified by independent repetition. Furthermore, our algorithm processes each stream update in O(1) worst-case time, and can report an estimate at any point midstream in O(1) worst-case time, thus settling both the space and time complexities simultaneously.
We also give an algorithm to estimate the Hamming norm of a stream, a generalization of the number of distinct elements, which is useful in data cleaning, packet tracing, and database auditing. Our algorithm uses nearly optimal space, and has optimal O(1) update and reporting times.
- S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The Aqua approximate query answering system. In SIGMOD, pages 574--576, 1999. Google ScholarDigital Library
- A. Akella, A. Bharambe, M. Reiter, and S. Seshan. Detecting DDoS attacks on ISP networks. In Proc. MPDS, 2003.Google Scholar
- N. Alon, Y. Matias, and M. Szegedy. The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci., 58(1):137--147, 1999. Google ScholarDigital Library
- Z. Bar-Yossef, T.S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In Proc. RANDOM, pages 1--10, 2002. Google ScholarDigital Library
- Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proc. SODA, pages 623--632, 2002. Google ScholarDigital Library
- K. S. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In SIGMOD, pages 199--210, 2007. Google ScholarDigital Library
- D. K. Blandford and G. E. Blelloch. Compact dictionaries for variable-length keys and data with applications. ACM Trans. Alg., 4(2), 2008. Google ScholarDigital Library
- A. Brodnik. Computation of the least significant set bit. In Proc. ERK, 1993.Google Scholar
- J. Brody and A. Chakrabarti. A multi-round communication lower bound for gap hamming and some consequences. In Proc. CCC, pages 358--368, 2009. Google ScholarDigital Library
- P. Brown, P. J. Haas, J. Myllymaki, H. Pirahesh, B. Reinwald, and Y. Sismanis. Toward automated large-scale information integration and discovery. In Data Management in a Connected World, pages 161--180, 2005. Google ScholarDigital Library
- L. Carter and M. N. Wegman. Universal classes of hash functions. J. Comput. Syst. Sci., 18(2):143--154, 1979.Google ScholarCross Ref
- E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci., 55(3):441--453, 1997. Google ScholarDigital Library
- G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan. Comparing data streams using hamming norms (how to zero in). IEEE Trans. Knowl. Data Eng., 15(3):529--540, 2003. Google ScholarDigital Library
- T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In SIGMOD, pages 240--251, 2002. Google ScholarDigital Library
- D. P. Dubhashi and D. Ranjan. Balls and bins: A study in negative dependence. Random Struct. Algorithms, 13(2):99--124, 1998. Google ScholarDigital Library
- M. Durand and P. Flajolet. Loglog counting of large cardinalities (extended abstract). In Proc. ESA, pages 605--617, 2003.Google Scholar
- C. Estan, G. Varghese, and M. E. Fisk. Bitmap algorithms for counting active flows on high-speed links. IEEE/ACM Trans. Netw., 14(5):925--937, 2006. Google ScholarDigital Library
- S. J. Finkelstein, M. Schkolnick, and P. Tiberio. Physical database design for relational databases. ACM Trans. Database Syst., 13(1):91--128, 1988. Google ScholarDigital Library
- P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. Disc. Math. and Theor. Comp. Sci., AH:127--146, 2007.Google Scholar
- P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182--209, 1985. Google ScholarDigital Library
- M. L. Fredman and D. E. Willard. Surpassing the information theoretic bound with fusion trees. J. Comput. Syst. Sci., 47(3):424--436, 1993. Google ScholarDigital Library
- S. Ganguly. Counting distinct items over update streams. Theor. Comput. Sci., 378(3):211--222, 2007. Google ScholarDigital Library
- P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB, pages 541--550, 2001. Google ScholarDigital Library
- P. B. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In Proc. SPAA, pages 281--291, 2001. Google ScholarDigital Library
- P. Indyk. Algorithms for dynamic geometric problems over data streams. In Proc. STOC, pages 373--380, 2004. Google ScholarDigital Library
- P. Indyk and D. P. Woodruff. Tight lower bounds for the distinct elements problem. In Proc. FOCS, pages 283---, 2003. Google ScholarDigital Library
- D. M. Kane, J. Nelson, and D. P. Woodruff. On the exact space complexity of sketching and streaming small norms. In Proc. SODA, pages 1161--1178, 2010. Google ScholarDigital Library
- R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Google ScholarCross Ref
- M. H. Overmars. The Design of Dynamic Data Structures. Springer, 1983. Google ScholarDigital Library
- S. Padmanabhan, B. Bhattacharjee, T. Malkemus, L. Cranston, and M. Huras. Multi-dimensional clustering: A new data layout scheme in db2. In SIGMOD, pages 637--641, 2003. Google ScholarDigital Library
- A. Pagh and R. Pagh. Uniform hashing in constant time and optimal space. SIAM J. Comput., 38(1):85--96, 2008. Google ScholarDigital Library
- C. R. Palmer, G. Siganos, M. Faloutsos, and C. Faloutsos. The connectivity and fault-tolerance of the internet topology. In NRDM Workshop, 2001.Google Scholar
- P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In SIGMOD, pages 23--34, 1979. Google ScholarDigital Library
- A. Shukla, P. Deshpande, J. F. Naughton, and K. Ramasamy. Storage estimation for multidimensional aggregates in the presence of hierarchies. In Proc. VLDB, pages 522--531, 1996. Google ScholarDigital Library
- A. Siegel. On universal classes of extremely random constant-time hash functions. SIAM J. Computing, 33(3):505--543, 2004. Google ScholarDigital Library
- D. P. Woodruff. Optimal space lower bounds for all frequency moments. In Proc. SODA, pages 167--175, 2004. Google ScholarDigital Library
Index Terms
- An optimal algorithm for the distinct elements problem
Recommendations
Optimal Streaming and Tracking Distinct Elements with High Probability
Special Issue on Soda'18 and Regular PapersThe distinct elements problem is one of the fundamental problems in streaming algorithms—given a stream of integers in the range { 1,… ,n}, we wish to provide a (1+ε) approximation to the number of distinct elements in the input. After a long line of ...
Range-Efficient Counting of Distinct Elements in a Massive Data Stream
Efficient one-pass estimation of $F_0$, the number of distinct elements in a data stream, is a fundamental problem arising in various contexts in databases and networking. We consider range-efficient estimation of $F_0$: estimation of the number of ...
Optimal Bounds for Johnson-Lindenstrauss Transforms and Streaming Problems with Subconstant Error
Special Issue on SODA'11The Johnson-Lindenstrauss transform is a dimensionality reduction technique with a wide range of applications to theoretical computer science. It is specified by a distribution over projection matrices from Rn → Rk where k n and states that k = O(ε−2 ...
Comments