ABSTRACT
One of the most common statistics computed over data elements is the number of distinct keys. A thread of research pioneered by Flajolet and Martin three decades ago culminated in the design of optimal approximate counting sketches, which have size that is double logarithmic in the number of distinct keys and provide estimates with a small relative error. Moreover, the sketches are composable, and thus suitable for streamed, parallel, or distributed computation.
We consider here all statistics of the frequency distribution of keys, where a contribution of a key to the aggregate is concave and grows (sub)linearly with its frequency. These fundamental aggregations are very common in text, graphs, and logs analysis and include logarithms, low frequency moments, and cap statistics.
We design composable sketches of double-logarithmic size for all concave sublinear statistics. Our design combines theoretical optimality and practical simplicity. In a nutshell, we specify tailored mapping functions of data elements to output elements so that our target statistics on the data elements is approximated by the (max-) distinct statistics of the output elements, which can be approximated using off-the-shelf sketches. Our key insight is relating these target statistics to the complement Laplace transform of the input frequencies.
Supplemental Material
- N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. J. Comput. System Sci., 58:137--147, 1999. Google ScholarDigital Library
- V. Braverman and R. Ostrovsky. Zero-one frequency laws. In STOC. ACM, 2010. Google ScholarDigital Library
- E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci., 55:441--453, 1997. Google ScholarDigital Library
- E. Cohen. All-distances sketches, revisited: HIP estimators for massive graphs analysis. TKDE, 2015.Google Scholar
- E. Cohen. Multi-objective weighted sampling. In HotWeb. IEEE, 2015. full version: http://arxiv.org/abs/1509.07445.Google ScholarDigital Library
- E. Cohen. Stream sampling for frequency cap statistics. In KDD. ACM, 2015. full version: http://arxiv.org/abs/1502.05955.Google ScholarDigital Library
- E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. Algorithms and estimators for accurate summarization of unaggregated data streams. J. Comput. System Sci., 80, 2014.Google Scholar
- E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In ACM PODC, 2007. Google ScholarDigital Library
- E. Cohen and H. Kaplan. Tighter estimation using bottom-k sketches. In Proceedings of the 34th VLDB Conference, 2008. Google ScholarDigital Library
- E. Cohen, H. Kaplan, and S. Sen. Coordinated weighted sampling for estimating aggregates over multiple weight assignments. VLDB, 2(1--2), 2009. full: http://arxiv.org/abs/0906.4560.Google Scholar
- N. Duffield, M. Thorup, and C. Lund. Priority sampling for estimating arbitrary subset sums. J. Assoc. Comput. Mach., 54(6), 2007. Google ScholarDigital Library
- C. Estan and G. Varghese. New directions in traffic measurement and accounting. In SIGCOMM. ACM, 2002. Google ScholarDigital Library
- P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Analysis of Algorithms (AofA). DMTCS, 2007.Google ScholarCross Ref
- P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. System Sci., 31:182--209, 1985. Google ScholarDigital Library
- P. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In SIGMOD. ACM, 1998. Google ScholarDigital Library
- Google. Frequency capping: AdWords help, December 2014. https://support.google.com/adwords/answer/117579.Google Scholar
- S. Heule, M. Nunkesser, and A. Hall. HyperLogLog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT, 2013.Google ScholarDigital Library
- P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proc. 41st IEEE Annual Symposium on Foundations of Computer Science, pages 189--197. IEEE, 2001.Google ScholarDigital Library
- W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Math., 26, 1984. Google ScholarCross Ref
- D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elements problem. In PODS, 2010. Google ScholarDigital Library
- D. E. Knuth. The Art of Computer Programming, Vol 2, Seminumerical Algorithms. Addison-Wesley, 1st edition, 1968.Google Scholar
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.Google ScholarDigital Library
- R. Morris. Counting large numbers of events in small registers. Comm. ACM, 21, 1977.Google Scholar
- S. Nandanwar and N. N. Murty. Structural neighborhood based classification of nodes in a network. In KDD. ACM, 2016. Google ScholarDigital Library
- E. Ohlsson. Sequential poisson sampling. J. Official Statistics, 14(2):149--162, 1998.Google Scholar
- M. Osborne. Facebook Reach and Frequency Buying, October 2014. http://citizennet.com/blog/2014/10/01/facebook-reach-and-frequency-buying/.Google Scholar
- J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.Google ScholarCross Ref
- B. Rosén. Asymptotic theory for successive sampling with varying probabilities without replacement, I. The Annals of Mathematical Statistics, 43(2):373--397, 1972. Google ScholarCross Ref
- B. Rosén. Asymptotic theory for order sampling. J. Statistical Planning and Inference, 62(2):135--158, 1997. Google ScholarCross Ref
- G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513--523, 1988. Google ScholarDigital Library
- J.S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37--57, 1985.endthebibliography Google ScholarDigital Library
Index Terms
- HyperLogLog Hyperextended: Sketches for Concave Sublinear Frequency Statistics
Recommendations
Stream Sampling Framework and Application for Frequency Cap Statistics
Unaggregated data, in a streamed or distributed form, are prevalent and come from diverse sources such as interactions of users with web services and IP traffic. Data elements have keys (cookies, users, queries), and elements with different keys ...
Perfect $L_p$ Sampling in a Data Stream
In this paper, we resolve the one-pass space complexity of perfect $L_p$ sampling for $p \in (0,2)$ in a stream. Given a stream of updates (insertions and deletions) to the coordinates of an underlying vector $f \in \mathbb{R}^n$, a perfect $L_p$ sampler ...
Single Pass Spectral Sparsification in Dynamic Streams
We present the first single pass algorithm for computing spectral sparsifiers for graphs in the dynamic semi-streaming model. Given a single pass over a stream containing insertions and deletions of edges to a graph $G$, our algorithm maintains a ...
Comments