research-article

Open Access

HyperLogLog Hyperextended: Sketches for Concave Sublinear Frequency Statistics

Author:
Edith Cohen

Google Research, Mountain View, CA, USA

Google Research, Mountain View, CA, USA
View Profile

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningAugust 2017Pages 105–114https://doi.org/10.1145/3097983.3098020

Published:04 August 2017Publication History

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 105–114

ABSTRACT

One of the most common statistics computed over data elements is the number of distinct keys. A thread of research pioneered by Flajolet and Martin three decades ago culminated in the design of optimal approximate counting sketches, which have size that is double logarithmic in the number of distinct keys and provide estimates with a small relative error. Moreover, the sketches are composable, and thus suitable for streamed, parallel, or distributed computation.

We consider here all statistics of the frequency distribution of keys, where a contribution of a key to the aggregate is concave and grows (sub)linearly with its frequency. These fundamental aggregations are very common in text, graphs, and logs analysis and include logarithms, low frequency moments, and cap statistics.

We design composable sketches of double-logarithmic size for all concave sublinear statistics. Our design combines theoretical optimality and practical simplicity. In a nutshell, we specify tailored mapping functions of data elements to output elements so that our target statistics on the data elements is approximated by the (max-) distinct statistics of the output elements, which can be approximated using off-the-shelf sketches. Our key insight is relating these target statistics to the complement Laplace transform of the input frequencies.

Supplemental Material

cohen_hyperloglog.mp4

mp4

471.2 MB

Download

References

N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. J. Comput. System Sci., 58:137--147, 1999. Google ScholarDigital Library
V. Braverman and R. Ostrovsky. Zero-one frequency laws. In STOC. ACM, 2010. Google ScholarDigital Library
E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci., 55:441--453, 1997. Google ScholarDigital Library
E. Cohen. All-distances sketches, revisited: HIP estimators for massive graphs analysis. TKDE, 2015.Google Scholar
E. Cohen. Multi-objective weighted sampling. In HotWeb. IEEE, 2015. full version: http://arxiv.org/abs/1509.07445.Google ScholarDigital Library
E. Cohen. Stream sampling for frequency cap statistics. In KDD. ACM, 2015. full version: http://arxiv.org/abs/1502.05955.Google ScholarDigital Library
E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. Algorithms and estimators for accurate summarization of unaggregated data streams. J. Comput. System Sci., 80, 2014.Google Scholar
E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In ACM PODC, 2007. Google ScholarDigital Library
E. Cohen and H. Kaplan. Tighter estimation using bottom-k sketches. In Proceedings of the 34th VLDB Conference, 2008. Google ScholarDigital Library
E. Cohen, H. Kaplan, and S. Sen. Coordinated weighted sampling for estimating aggregates over multiple weight assignments. VLDB, 2(1--2), 2009. full: http://arxiv.org/abs/0906.4560.Google Scholar
N. Duffield, M. Thorup, and C. Lund. Priority sampling for estimating arbitrary subset sums. J. Assoc. Comput. Mach., 54(6), 2007. Google ScholarDigital Library
C. Estan and G. Varghese. New directions in traffic measurement and accounting. In SIGCOMM. ACM, 2002. Google ScholarDigital Library
P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Analysis of Algorithms (AofA). DMTCS, 2007.Google ScholarCross Ref
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. System Sci., 31:182--209, 1985. Google ScholarDigital Library
P. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In SIGMOD. ACM, 1998. Google ScholarDigital Library
Google. Frequency capping: AdWords help, December 2014. https://support.google.com/adwords/answer/117579.Google Scholar
S. Heule, M. Nunkesser, and A. Hall. HyperLogLog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT, 2013.Google ScholarDigital Library
P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proc. 41st IEEE Annual Symposium on Foundations of Computer Science, pages 189--197. IEEE, 2001.Google ScholarDigital Library
W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Math., 26, 1984. Google ScholarCross Ref
D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elements problem. In PODS, 2010. Google ScholarDigital Library
D. E. Knuth. The Art of Computer Programming, Vol 2, Seminumerical Algorithms. Addison-Wesley, 1st edition, 1968.Google Scholar
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.Google ScholarDigital Library
R. Morris. Counting large numbers of events in small registers. Comm. ACM, 21, 1977.Google Scholar
S. Nandanwar and N. N. Murty. Structural neighborhood based classification of nodes in a network. In KDD. ACM, 2016. Google ScholarDigital Library
E. Ohlsson. Sequential poisson sampling. J. Official Statistics, 14(2):149--162, 1998.Google Scholar
M. Osborne. Facebook Reach and Frequency Buying, October 2014. http://citizennet.com/blog/2014/10/01/facebook-reach-and-frequency-buying/.Google Scholar
J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.Google ScholarCross Ref
B. Rosén. Asymptotic theory for successive sampling with varying probabilities without replacement, I. The Annals of Mathematical Statistics, 43(2):373--397, 1972. Google ScholarCross Ref
B. Rosén. Asymptotic theory for order sampling. J. Statistical Planning and Inference, 62(2):135--158, 1997. Google ScholarCross Ref
G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513--523, 1988. Google ScholarDigital Library
J.S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37--57, 1985.endthebibliography Google ScholarDigital Library

Index Terms

HyperLogLog Hyperextended: Sketches for Concave Sublinear Frequency Statistics
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
      2. Stream management
  2. Information systems applications

Recommendations

Stream Sampling Framework and Application for Frequency Cap Statistics

Unaggregated data, in a streamed or distributed form, are prevalent and come from diverse sources such as interactions of users with web services and IP traffic. Data elements have keys (cookies, users, queries), and elements with different keys ...
Read More
Perfect $L_p$ Sampling in a Data Stream

In this paper, we resolve the one-pass space complexity of perfect $L_p$ sampling for $p \in (0,2)$ in a stream. Given a stream of updates (insertions and deletions) to the coordinates of an underlying vector $f \in \mathbb{R}^n$, a perfect $L_p$ sampler ...
Read More
Single Pass Spectral Sparsification in Dynamic Streams

We present the first single pass algorithm for computing spectral sparsifiers for graphs in the dynamic semi-streaming model. Given a single pass over a stream containing insertions and deletions of edges to a graph $G$, our algorithm maintains a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2017
2240 pages
ISBN:9781450348874
DOI:10.1145/3097983
General Chairs:
Stan Matwin
Dalhousie University
,
Shipeng Yu
LinkedIn
,
Faisal Farooq
IBM
Copyright © 2017 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 August 2017
Check for updates
Author Tags
concave sublinear
distributed aggregation
frequency statistics
laplace transform
streaming
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '17 Paper Acceptance Rate64of748submissions,9%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 959
  Total Downloads
- Downloads (Last 12 months)66
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HyperLogLog Hyperextended: Sketches for Concave Sublinear Frequency Statistics

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Stream Sampling Framework and Application for Frequency Cap Statistics

Perfect $L_p$ Sampling in a Data Stream

Single Pass Spectral Sparsification in Dynamic Streams