Stream Sampling Framework and Application for Frequency Cap Statistics

Author:
Edith Cohen

Google AI, CA, USA and Tel Aviv University, Israel

Google AI, CA, USA and Tel Aviv University, Israel
View Profile

Authors Info & Claims

ACM Transactions on Algorithms Volume 14 Issue 4Article No.: 52pp 1–40https://doi.org/10.1145/3234338

Published:24 September 2018Publication History

ACM Transactions on Algorithms

Abstract

Unaggregated data, in a streamed or distributed form, are prevalent and come from diverse sources such as interactions of users with web services and IP traffic. Data elements have keys (cookies, users, queries), and elements with different keys interleave. Analytics on such data typically utilizes statistics expressed as a sum over keys in a specified segment of a function f applied to the frequency (the total number of occurrences) of the key. In particular, Distinct is the number of active keys in the segment, Sum is the sum of their frequencies, and both are special cases of frequency cap statistics, which cap the frequency by a parameter T.

Random samples can be very effective for quick and efficient estimation of statistics at query time. Ideally, to estimate statistics for a given function f, our sample would include a key with frequency w with probability roughly proportional to f(w). The challenge is that while such “gold-standard” samples can be easily computed after aggregating the data (computing the set of key-frequency pairs), this aggregation is costly: It requires structure of size that is proportional to the number of active keys, which can be very large.

We present a sampling framework for unaggregated data that uses a single pass (for streams) or two passes (for distributed data) and structure size proportional to the desired sample size. Our design unifies classic solutions for Distinct and Sum. Specifically, our ℓ-capped samples provide nonnegative unbiased estimates of any monotone non-decreasing frequency statistics and statistical guarantees on quality that are close to gold standard for cap statistics with T=Θ (ℓ). Furthermore, our multi-objective samples provide these statistical guarantees on quality for all concave sub-linear statistics (the nonnegative span of cap functions) while incurring only a logarithmic overhead on sample size.

References

N. Alon, Y. Matias, and M. Szegedy. 1999. The space complexity of approximating the frequency moments. J. Comput. System Sci. 58, 1 (1999), 137--147. Google ScholarDigital Library
V. Braverman and R. Ostrovsky. 2010. Zero-one frequency laws. In Proceedings of the Annual ACM Symposium on Theory of Computing Conference (STOC’10). ACM. Google ScholarDigital Library
K. R. W. Brewer, L. J. Early, and S. F. Joyce. 1972. Selecting several samples from a single population. Austr. J. Stat. 14, 3 (1972), 231--239.Google ScholarCross Ref
M. T. Chao. 1982. A general purpose unequal probability sampling plan. Biometrika 69, 3 (1982), 653--656.Google ScholarCross Ref
E. Cohen. 1997. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci. 55, 3 (1997), 441--453. Google ScholarDigital Library
E. Cohen. 2014. All-distances sketches, revisited: HIP estimators for massive graphs analysis. In Proceedings of the Pipeline Open Data Standard Conference (PODS’14). ACM. Google ScholarDigital Library
E. Cohen. 2015. Multi-objective weighted sampling. In Proceedings of the 2015 Third IEEE Workshop on Hot Topics in Web Systems and Technologies (HotWeb'15). IEEE. Google ScholarDigital Library
E. Cohen. 2015. Stream sampling for frequency cap statistics. In Proceedings of the Knowledge Discovery and Data Mining Conference (KDD’15). ACM, 2015. Google ScholarDigital Library
E. Cohen. 2017. Hyperloglog hyper extended: Sketches for concave sublinear frequency statistics. In Proceedings of the Knowledge Discovery and Data Mining Conference (KDD’17). ACM, 2017. Google ScholarDigital Library
E. Cohen, G. Cormode, and N. Duffield. 2012. Don’t let the negatives bring you down: Sampling from streams of signed updates. In Proceedings of the ACM SIGMETRICS/Performance Conference. Google ScholarDigital Library
E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. 2009. Composable, scalable, and accurate weight summarization of unaggregated datasets. Proc. VLDB 2, 1 (2009), 431--442. Google ScholarDigital Library
E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. 2014. Algorithms and estimators for accurate summarization of unaggregated data streams. J. Comput. System Sci. 80, 7 (2014), 1214--1244.Google ScholarCross Ref
E. Cohen, N. Duffield, C. Lund, M. Thorup, and H. Kaplan. 2011. Efficient stream sampling for variance-optimal estimation of subset sums. SIAM J. Comput. 40, 5 (2011). Google ScholarDigital Library
E. Cohen and H. Kaplan. 2008. Tighter estimation using bottom-k sketches. In Proceedings of the 34th VLDB Conference. Google ScholarDigital Library
E. Cohen, H. Kaplan, and S. Sen. 2009. Coordinated weighted sampling for estimating aggregates over multiple weight assignments. Proc. VLDB 2, 1--2 (2009). Google ScholarDigital Library
G. Cormode and S. Muthukrishnan. 2005. An improved data stream summary: The count-min sketch and its applications. J. Algor. 55, 1 (2005). Google ScholarDigital Library
N. Duffield, M. Thorup, and C. Lund. 2007. Priority sampling for estimating arbitrary subset sums. J. Assoc. Comput. Mach. 54, 6 (2007). Google ScholarDigital Library
C. Estan and G. Varghese. 2002. New directions in traffic measurement and accounting. In Proceedings of the Conference of the Special Interest Group on Data Communication (SIGCOMM’02). ACM. Google ScholarDigital Library
W. Feller. 1971. An Introduction to Probability Theory and Its Applications, Vol. 2. John Wiley 8 Sons, New York, NY.Google Scholar
P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. 2007. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Analysis of Algorithms. DMTCS.Google Scholar
P. Flajolet and G. N. Martin. 1985. Probabilistic counting algorithms for data base applications. J. Comput. System Sci. 31, 2 (1985), 182--209. Google ScholarDigital Library
R. Gemulla, W. Lehner, and P. J. Haas. 2006. A dip in the reservoir: Maintaining sample synopses of evolving datasets. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB'06). 595--606. Google ScholarDigital Library
P. Gibbons and Y. Matias. 1998. New sampling-based summary statistics for improving approximate query answers. In Proceedings of the Conference of the Special Interest Group on Management of Data (SIGMOD’98). ACM. Google ScholarDigital Library
Google. Frequency capping: AdWords help. Retrieved December 2014 from https://support.google.com/adwords/answer/117579.Google Scholar
S. Heule, M. Nunkesser, and A. Hall. 2013. HyperLogLog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In Proceedings of the International Conference on Extending Database Technology (EDBT’13). Google ScholarDigital Library
N. Hohn and D. Veitch. 2003. Inverting sampled traffic. In Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement. 222--233. Google ScholarDigital Library
D. G. Horvitz and D. J. Thompson. 1952. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47, 260 (1952), 663--685.Google ScholarCross Ref
P. Indyk. 2001. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proceedings of the 41st IEEE Annual Symposium on Foundations of Computer Science. IEEE, 189--197. Google ScholarDigital Library
W. Johnson and J. Lindenstrauss. 1984. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Math. 26.Google Scholar
H. Jowhari, M. Saglam, and G. Tardos. 2011. Tight bounds for Lp samplers, finding duplicates in streams, and related problems. In Proceedings of the Pipeline Open Data Standard Conference (PODS’11). Google ScholarDigital Library
D. E. Knuth. 1968. The Art of Computer Programming, Vol. 2, Seminumerical Algorithms (1st ed.). Addison-Wesley. Google ScholarDigital Library
J. Misra and D. Gries. 1982. Finding repeated elements. Technical Report, Cornell University. Google ScholarDigital Library
M. Monemizadeh and D. P. Woodruff. 2010. one-pass relative-error l<sup>p</sup>-sampling with applications. In Proceedings of the 21st ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM. Google ScholarDigital Library
E. Ohlsson. 1998. Sequential poisson sampling. J. Off. Stat. 14, 2 (1998), 149--162.Google Scholar
M. Osborne. Facebook Reach and Frequency Buying. Retrieved October 2014 from http://citizennet.com/blog/2014/10/01/facebook-reach-and-frequency-buying/.Google Scholar
B. Rosén. 1972. Asymptotic theory for successive sampling with varying probabilities without replacement, I. Ann. Math. Stat. 43, 2 (1972), 373--397.Google ScholarCross Ref
B. Rosén. 1997. Asymptotic theory for order sampling. J. Stat. Plan. Inf. 62, 2 (1997), 135--158.Google ScholarCross Ref
M. Szegedy. 2005. Near optimality of the priority sampling procedure. Technical Report TR05-001, Electronic Colloquium on Computational Complexity.Google Scholar
Y. Tillé. 2006. Sampling Algorithms. Springer-Verlag, New York.Google Scholar

Index Terms

Stream Sampling Framework and Application for Frequency Cap Statistics
1. Theory of computation
  1. Design and analysis of algorithms
    1. Streaming, sublinear and near linear time algorithms
      1. Sketching and sampling

Recommendations

HyperLogLog Hyperextended: Sketches for Concave Sublinear Frequency Statistics
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

One of the most common statistics computed over data elements is the number of distinct keys. A thread of research pioneered by Flajolet and Martin three decades ago culminated in the design of optimal approximate counting sketches, which have size that ...
Read More
On multivariate order statistics. Application to ranked set sampling

Two new concepts of order statistics for multivariate samples are introduced. In one of the versions it turns out that not every multivariate order statistic is present in every sample. These order statistics have application in multivariate ranked set ...
Read More
Statistical methods for frequency data from complex sampling schemes
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Algorithms Volume 14, Issue 4
October 2018
445 pages
ISSN:1549-6325
EISSN:1549-6333
DOI:10.1145/3266298
Editor:
Aravind Srinivasan
University of Maryland, USA
Issue’s Table of Contents
Copyright © 2018 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 September 2018
- Accepted: 1 June 2018
- Revised: 1 December 2017
- Received: 1 March 2017
Published in talg Volume 14, Issue 4

Check for updates
Author Tags
Frequency statistics
distributed aggregation
stream processing
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 588
  Total Downloads
- Downloads (Last 12 months)55
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Stream Sampling Framework and Application for Frequency Cap Statistics

ACM Transactions on Algorithms

Abstract

References

Cited By

Index Terms

Recommendations

HyperLogLog Hyperextended: Sketches for Concave Sublinear Frequency Statistics

On multivariate order statistics. Application to ranked set sampling

Statistical methods for frequency data from complex sampling schemes