skip to main content
10.1145/3097983.3098020acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Open Access

HyperLogLog Hyperextended: Sketches for Concave Sublinear Frequency Statistics

Published:04 August 2017Publication History

ABSTRACT

One of the most common statistics computed over data elements is the number of distinct keys. A thread of research pioneered by Flajolet and Martin three decades ago culminated in the design of optimal approximate counting sketches, which have size that is double logarithmic in the number of distinct keys and provide estimates with a small relative error. Moreover, the sketches are composable, and thus suitable for streamed, parallel, or distributed computation.

We consider here all statistics of the frequency distribution of keys, where a contribution of a key to the aggregate is concave and grows (sub)linearly with its frequency. These fundamental aggregations are very common in text, graphs, and logs analysis and include logarithms, low frequency moments, and cap statistics.

We design composable sketches of double-logarithmic size for all concave sublinear statistics. Our design combines theoretical optimality and practical simplicity. In a nutshell, we specify tailored mapping functions of data elements to output elements so that our target statistics on the data elements is approximated by the (max-) distinct statistics of the output elements, which can be approximated using off-the-shelf sketches. Our key insight is relating these target statistics to the complement Laplace transform of the input frequencies.

Skip Supplemental Material Section

Supplemental Material

cohen_hyperloglog.mp4

mp4

471.2 MB

References

  1. N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. J. Comput. System Sci., 58:137--147, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. V. Braverman and R. Ostrovsky. Zero-one frequency laws. In STOC. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci., 55:441--453, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E. Cohen. All-distances sketches, revisited: HIP estimators for massive graphs analysis. TKDE, 2015.Google ScholarGoogle Scholar
  5. E. Cohen. Multi-objective weighted sampling. In HotWeb. IEEE, 2015. full version: http://arxiv.org/abs/1509.07445.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. Cohen. Stream sampling for frequency cap statistics. In KDD. ACM, 2015. full version: http://arxiv.org/abs/1502.05955.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. Algorithms and estimators for accurate summarization of unaggregated data streams. J. Comput. System Sci., 80, 2014.Google ScholarGoogle Scholar
  8. E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In ACM PODC, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Cohen and H. Kaplan. Tighter estimation using bottom-k sketches. In Proceedings of the 34th VLDB Conference, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. E. Cohen, H. Kaplan, and S. Sen. Coordinated weighted sampling for estimating aggregates over multiple weight assignments. VLDB, 2(1--2), 2009. full: http://arxiv.org/abs/0906.4560.Google ScholarGoogle Scholar
  11. N. Duffield, M. Thorup, and C. Lund. Priority sampling for estimating arbitrary subset sums. J. Assoc. Comput. Mach., 54(6), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Estan and G. Varghese. New directions in traffic measurement and accounting. In SIGCOMM. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Analysis of Algorithms (AofA). DMTCS, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  14. P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. System Sci., 31:182--209, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In SIGMOD. ACM, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Google. Frequency capping: AdWords help, December 2014. https://support.google.com/adwords/answer/117579.Google ScholarGoogle Scholar
  17. S. Heule, M. Nunkesser, and A. Hall. HyperLogLog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proc. 41st IEEE Annual Symposium on Foundations of Computer Science, pages 189--197. IEEE, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Math., 26, 1984. Google ScholarGoogle ScholarCross RefCross Ref
  20. D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elements problem. In PODS, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. E. Knuth. The Art of Computer Programming, Vol 2, Seminumerical Algorithms. Addison-Wesley, 1st edition, 1968.Google ScholarGoogle Scholar
  22. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Morris. Counting large numbers of events in small registers. Comm. ACM, 21, 1977.Google ScholarGoogle Scholar
  24. S. Nandanwar and N. N. Murty. Structural neighborhood based classification of nodes in a network. In KDD. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. E. Ohlsson. Sequential poisson sampling. J. Official Statistics, 14(2):149--162, 1998.Google ScholarGoogle Scholar
  26. M. Osborne. Facebook Reach and Frequency Buying, October 2014. http://citizennet.com/blog/2014/10/01/facebook-reach-and-frequency-buying/.Google ScholarGoogle Scholar
  27. J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  28. B. Rosén. Asymptotic theory for successive sampling with varying probabilities without replacement, I. The Annals of Mathematical Statistics, 43(2):373--397, 1972. Google ScholarGoogle ScholarCross RefCross Ref
  29. B. Rosén. Asymptotic theory for order sampling. J. Statistical Planning and Inference, 62(2):135--158, 1997. Google ScholarGoogle ScholarCross RefCross Ref
  30. G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513--523, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J.S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37--57, 1985.endthebibliography Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. HyperLogLog Hyperextended: Sketches for Concave Sublinear Frequency Statistics

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
              August 2017
              2240 pages
              ISBN:9781450348874
              DOI:10.1145/3097983

              Copyright © 2017 Owner/Author

              Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 4 August 2017

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              KDD '17 Paper Acceptance Rate64of748submissions,9%Overall Acceptance Rate1,133of8,635submissions,13%

              Upcoming Conference

              KDD '24

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader