skip to main content
research-article
Public Access

Optimal Streaming and Tracking Distinct Elements with High Probability

Published:05 December 2019Publication History
Skip Abstract Section

Abstract

The distinct elements problem is one of the fundamental problems in streaming algorithms—given a stream of integers in the range { 1,… ,n}, we wish to provide a (1+ε) approximation to the number of distinct elements in the input. After a long line of research an optimal solution for this problem with constant probability of success, using O(1/ε2+lg n) bits of space, was given by Kane, Nelson, and Woodruff in 2010.

The standard approach used to achieve low failure probability δ is to take the median of lg δ−1 parallel repetitions of the original algorithm. We show that such a multiplicative space blow-up is unnecessary: We provide an optimal algorithm using O(lg δ−12 + lg n) bits of space—matching known lower bounds for this problem. That is, the lg δ−1; factor does not multiply the lg n term. This settles completely the space complexity of the distinct elements problem with respect to all standard parameters.

We consider also the strong tracking (or continuous monitoring) variant of the distinct elements problem, where we want an algorithm that provides an approximation of the number of distinct elements seen so far, at all times of the stream. We show that this variant can be solved using O(lg lg n + lg δ −12 + lg n) bits of space, which we show to be optimal.

References

  1. Noga Alon, Yossi Matias, and Mario Szegedy. 1996. The space complexity of approximating the frequency moments. In Proceedings of the 28th Annual ACM Symposium on the Theory of Computing. ACM, 20--29. DOI:https://doi.org/10.1145/237814.237823Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Joshua Brody and Amit Chakrabarti. 2009. A multi-round communication lower bound for gap hamming and some consequences. In Proceedings of the Electronic Colloquium on Computational Complexity (ECCC’09). 15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang, and David P. Woodruff. 2017. BPTree: An ℓ2 heavy hitters algorithm using constant memory. In Proceedings of the 36th SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’17).Google ScholarGoogle Scholar
  4. Vladimir Braverman, Stephen R. Chestnut, David P. Woodruff, and Lin F. Yang. 2016. Streaming space complexity of nearly all functions of one variable on frequency vectors. In Proceedings of the 35 ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS’16). 261--276.Google ScholarGoogle Scholar
  5. Jarosław Błasiok, Jian Ding, and Jelani Nelson. 2017. Continuous monitoring of ℓp norms in data streams in approximation, randomization, and combinatorial optimization. Algorithms and Techniques (APPROX/RANDOM'17). DOI:https://doi.org/10.4230/LIPIcs.APPROX-RANDOM.2017.32Google ScholarGoogle Scholar
  6. Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. 2002. Counting distinct elements in a data stream. In Proceedings of the 6th International Workshop on Randomization and Approximation Techniques (RANDOM’02), José D. P. Rolim and Salil P. Vadhan (Eds.), Lecture Notes in Computer Science,Vol. 2483. Springer, 1--10. DOI:https://doi.org/10.1007/3-540-45726-7_1Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Mihir Bellare and John Rompel. 1994. Randomness-efficient oblivious sampling. In Proceedings of the 35 Annual IEEE Symposium on Foundations of Computer Science (FOCS’94). 276--287.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Marianne Durand and Philippe Flajolet. 2003. Loglog Counting of Large Cardinalities. Springer, Berlin, 605--617. DOI:https://doi.org/10.1007/978-3-540-39658-1_55Google ScholarGoogle Scholar
  9. Cristian Estan, George Varghese, and Michael E. Fisk. 2006. Bitmap algorithms for counting active flows on high-speed links. IEEE/ACM Trans. Netw. 14, 5 (2006), 925--937. DOI:https://doi.org/10.1145/1217709Google ScholarGoogle ScholarCross RefCross Ref
  10. Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In AofA: Analysis of Algorithms. Discrete Mathematics and Theoretical Computer Science, 137--156.Google ScholarGoogle Scholar
  11. Philippe Flajolet and G. Nigel Martin. 1983. Probabilistic counting. In Proceedings of the 24th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, 76--82. DOI:https://doi.org/10.1109/SFCS.1983.46Google ScholarGoogle Scholar
  12. D. J. H. Garling. 2007. Inequalities: A Journey into Linear Analysis. Cambridge University Press.Google ScholarGoogle ScholarCross RefCross Ref
  13. Ofer Gabber and Zvi Galil. 1981. Explicit constructions of linear-sized superconcentrators. J. Comput. Syst. Sci. 22, 3 (1981), 407--420.Google ScholarGoogle ScholarCross RefCross Ref
  14. Phillip B. Gibbons. 2001. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’01). Morgan Kaufmann, 541--550.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. David Gillman. 1998. A Chernoff bound for random walks on expander graphs. SIAM J. Comput. 27, 4 (Aug. 1998), 1203--1220. DOI:https://doi.org/10.1137/S0097539794268765Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Phillip B. Gibbons and Srikanta Tirthapura. 2001. Estimating simple functions on the union of data streams. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’01). 281--291. DOI:https://doi.org/10.1145/378580.378687Google ScholarGoogle Scholar
  17. Venkatesan Guruswami, Christopher Umans, and Salil P. Vadhan. 2009. Unbalanced expanders and randomness extractors from Parvaresh-Vardy codes. J. ACM 56, 4 (2009), 20:1--20:34. DOI:https://doi.org/10.1145/1538902.1538904Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Zengfeng Huang, Wai Ming Tai, and Ke Yi. 2014. Tracking the frequency moments at all times. CoRR abs/1412.1763 (2014).Google ScholarGoogle Scholar
  19. T. S. Jayram and David P. Woodruff. 2013. Optimal bounds for johnson-lindenstrauss transforms and streaming problems with subconstant error. ACM Trans. Algor. 9, 3 (2013), 26:1--26:17. DOI:https://doi.org/10.1145/2483699.2483706Google ScholarGoogle Scholar
  20. Daniel M. Kane, Jelani Nelson, and David P. Woodruff. 2010. An optimal algorithm for the distinct elements problem. In Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’10), Jan Paredaens and Dirk Van Gucht (Eds.). ACM, 41--52. DOI:https://doi.org/10.1145/1807085.1807094Google ScholarGoogle Scholar
  21. Raghu Meka. 2017. Explicit resilient functions matching Ajtai-Linial. In Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’17), Philip N. Klein (Ed.). SIAM, 1132--1148. DOI:https://doi.org/10.1137/1.9781611974782.73Google ScholarGoogle ScholarCross RefCross Ref
  22. Shravas Rao and Oded Regev. 2017. A sharp tail bound for the expander random sampler. CoRR abs/1703.10205 (2017). http://arxiv.org/abs/1703.10205Google ScholarGoogle Scholar
  23. Salil P. Vadhan. 2012. Pseudorandomness. Found. Trends Theor. Comput. Sci. 7, 1--3 (2012), 1--336. DOI:https://doi.org/10.1561/0400000010Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. O. Y. Wagner. 2008. Tail estimates for sums of variables sampled by a random walk. Combin. Probab. Comput. 17, 2 (2008), 307--316. DOI:https://doi.org/10.1017/S0963548307008772Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. David Woodruff. 2004. Optimal space lower bounds for all frequency moments. In Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’04). Society for Industrial and Applied Mathematics, Philadelphia, PA, 167--175. http://dl.acm.org/citation.cfm?id=982792.982817Google ScholarGoogle Scholar
  26. David Zuckerman. 1997. Randomness-optimal oblivious sampling. Rand. Struct. Algor. 11, 4 (1997), 345--367. DOI:https://doi.org/10.1002/(SICI)1098-2418(199712)11:4<345::AID-RSA4>3.0.CO;2-ZGoogle ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimal Streaming and Tracking Distinct Elements with High Probability

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Algorithms
        ACM Transactions on Algorithms  Volume 16, Issue 1
        Special Issue on Soda'18 and Regular Papers
        January 2020
        369 pages
        ISSN:1549-6325
        EISSN:1549-6333
        DOI:10.1145/3372373
        Issue’s Table of Contents

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 5 December 2019
        • Revised: 1 January 2019
        • Accepted: 1 January 2019
        • Received: 1 April 2018
        Published in talg Volume 16, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format