Abstract
The distinct elements problem is one of the fundamental problems in streaming algorithms—given a stream of integers in the range { 1,… ,n}, we wish to provide a (1+ε) approximation to the number of distinct elements in the input. After a long line of research an optimal solution for this problem with constant probability of success, using O(1/ε2+lg n) bits of space, was given by Kane, Nelson, and Woodruff in 2010.
The standard approach used to achieve low failure probability δ is to take the median of lg δ−1 parallel repetitions of the original algorithm. We show that such a multiplicative space blow-up is unnecessary: We provide an optimal algorithm using O(lg δ−1/ε2 + lg n) bits of space—matching known lower bounds for this problem. That is, the lg δ−1; factor does not multiply the lg n term. This settles completely the space complexity of the distinct elements problem with respect to all standard parameters.
We consider also the strong tracking (or continuous monitoring) variant of the distinct elements problem, where we want an algorithm that provides an approximation of the number of distinct elements seen so far, at all times of the stream. We show that this variant can be solved using O(lg lg n + lg δ −1/ε2 + lg n) bits of space, which we show to be optimal.
- Noga Alon, Yossi Matias, and Mario Szegedy. 1996. The space complexity of approximating the frequency moments. In Proceedings of the 28th Annual ACM Symposium on the Theory of Computing. ACM, 20--29. DOI:https://doi.org/10.1145/237814.237823Google ScholarDigital Library
- Joshua Brody and Amit Chakrabarti. 2009. A multi-round communication lower bound for gap hamming and some consequences. In Proceedings of the Electronic Colloquium on Computational Complexity (ECCC’09). 15.Google ScholarDigital Library
- Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang, and David P. Woodruff. 2017. BPTree: An ℓ2 heavy hitters algorithm using constant memory. In Proceedings of the 36th SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’17).Google Scholar
- Vladimir Braverman, Stephen R. Chestnut, David P. Woodruff, and Lin F. Yang. 2016. Streaming space complexity of nearly all functions of one variable on frequency vectors. In Proceedings of the 35 ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS’16). 261--276.Google Scholar
- Jarosław Błasiok, Jian Ding, and Jelani Nelson. 2017. Continuous monitoring of ℓp norms in data streams in approximation, randomization, and combinatorial optimization. Algorithms and Techniques (APPROX/RANDOM'17). DOI:https://doi.org/10.4230/LIPIcs.APPROX-RANDOM.2017.32Google Scholar
- Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. 2002. Counting distinct elements in a data stream. In Proceedings of the 6th International Workshop on Randomization and Approximation Techniques (RANDOM’02), José D. P. Rolim and Salil P. Vadhan (Eds.), Lecture Notes in Computer Science,Vol. 2483. Springer, 1--10. DOI:https://doi.org/10.1007/3-540-45726-7_1Google ScholarDigital Library
- Mihir Bellare and John Rompel. 1994. Randomness-efficient oblivious sampling. In Proceedings of the 35 Annual IEEE Symposium on Foundations of Computer Science (FOCS’94). 276--287.Google ScholarDigital Library
- Marianne Durand and Philippe Flajolet. 2003. Loglog Counting of Large Cardinalities. Springer, Berlin, 605--617. DOI:https://doi.org/10.1007/978-3-540-39658-1_55Google Scholar
- Cristian Estan, George Varghese, and Michael E. Fisk. 2006. Bitmap algorithms for counting active flows on high-speed links. IEEE/ACM Trans. Netw. 14, 5 (2006), 925--937. DOI:https://doi.org/10.1145/1217709Google ScholarCross Ref
- Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In AofA: Analysis of Algorithms. Discrete Mathematics and Theoretical Computer Science, 137--156.Google Scholar
- Philippe Flajolet and G. Nigel Martin. 1983. Probabilistic counting. In Proceedings of the 24th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, 76--82. DOI:https://doi.org/10.1109/SFCS.1983.46Google Scholar
- D. J. H. Garling. 2007. Inequalities: A Journey into Linear Analysis. Cambridge University Press.Google ScholarCross Ref
- Ofer Gabber and Zvi Galil. 1981. Explicit constructions of linear-sized superconcentrators. J. Comput. Syst. Sci. 22, 3 (1981), 407--420.Google ScholarCross Ref
- Phillip B. Gibbons. 2001. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’01). Morgan Kaufmann, 541--550.Google ScholarDigital Library
- David Gillman. 1998. A Chernoff bound for random walks on expander graphs. SIAM J. Comput. 27, 4 (Aug. 1998), 1203--1220. DOI:https://doi.org/10.1137/S0097539794268765Google ScholarDigital Library
- Phillip B. Gibbons and Srikanta Tirthapura. 2001. Estimating simple functions on the union of data streams. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’01). 281--291. DOI:https://doi.org/10.1145/378580.378687Google Scholar
- Venkatesan Guruswami, Christopher Umans, and Salil P. Vadhan. 2009. Unbalanced expanders and randomness extractors from Parvaresh-Vardy codes. J. ACM 56, 4 (2009), 20:1--20:34. DOI:https://doi.org/10.1145/1538902.1538904Google ScholarDigital Library
- Zengfeng Huang, Wai Ming Tai, and Ke Yi. 2014. Tracking the frequency moments at all times. CoRR abs/1412.1763 (2014).Google Scholar
- T. S. Jayram and David P. Woodruff. 2013. Optimal bounds for johnson-lindenstrauss transforms and streaming problems with subconstant error. ACM Trans. Algor. 9, 3 (2013), 26:1--26:17. DOI:https://doi.org/10.1145/2483699.2483706Google Scholar
- Daniel M. Kane, Jelani Nelson, and David P. Woodruff. 2010. An optimal algorithm for the distinct elements problem. In Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’10), Jan Paredaens and Dirk Van Gucht (Eds.). ACM, 41--52. DOI:https://doi.org/10.1145/1807085.1807094Google Scholar
- Raghu Meka. 2017. Explicit resilient functions matching Ajtai-Linial. In Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’17), Philip N. Klein (Ed.). SIAM, 1132--1148. DOI:https://doi.org/10.1137/1.9781611974782.73Google ScholarCross Ref
- Shravas Rao and Oded Regev. 2017. A sharp tail bound for the expander random sampler. CoRR abs/1703.10205 (2017). http://arxiv.org/abs/1703.10205Google Scholar
- Salil P. Vadhan. 2012. Pseudorandomness. Found. Trends Theor. Comput. Sci. 7, 1--3 (2012), 1--336. DOI:https://doi.org/10.1561/0400000010Google ScholarDigital Library
- R. O. Y. Wagner. 2008. Tail estimates for sums of variables sampled by a random walk. Combin. Probab. Comput. 17, 2 (2008), 307--316. DOI:https://doi.org/10.1017/S0963548307008772Google ScholarDigital Library
- David Woodruff. 2004. Optimal space lower bounds for all frequency moments. In Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’04). Society for Industrial and Applied Mathematics, Philadelphia, PA, 167--175. http://dl.acm.org/citation.cfm?id=982792.982817Google Scholar
- David Zuckerman. 1997. Randomness-optimal oblivious sampling. Rand. Struct. Algor. 11, 4 (1997), 345--367. DOI:https://doi.org/10.1002/(SICI)1098-2418(199712)11:4<345::AID-RSA4>3.0.CO;2-ZGoogle ScholarDigital Library
Index Terms
- Optimal Streaming and Tracking Distinct Elements with High Probability
Recommendations
An optimal algorithm for the distinct elements problem
PODS '10: Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsWe give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. This problem has ...
A Framework for Adversarially Robust Streaming Algorithms
We investigate the adversarial robustness of streaming algorithms. In this context, an algorithm is considered robust if its performance guarantees hold even if the stream is chosen adaptively by an adversary that observes the outputs of the algorithm ...
Optimal streaming and tracking distinct elements with high probability
SODA '18: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete AlgorithmsThe distinct elements problem is one of the fundamental problems in streaming algorithms --- given a stream of integers in the range {1, ... n}, we wish to provide a (1 + ε) approximation of the number of distinct elements in the input. After a long ...
Comments