Abstract
Unaggregated data, in a streamed or distributed form, are prevalent and come from diverse sources such as interactions of users with web services and IP traffic. Data elements have keys (cookies, users, queries), and elements with different keys interleave. Analytics on such data typically utilizes statistics expressed as a sum over keys in a specified segment of a function f applied to the frequency (the total number of occurrences) of the key. In particular, Distinct is the number of active keys in the segment, Sum is the sum of their frequencies, and both are special cases of frequency cap statistics, which cap the frequency by a parameter T.
Random samples can be very effective for quick and efficient estimation of statistics at query time. Ideally, to estimate statistics for a given function f, our sample would include a key with frequency w with probability roughly proportional to f(w). The challenge is that while such “gold-standard” samples can be easily computed after aggregating the data (computing the set of key-frequency pairs), this aggregation is costly: It requires structure of size that is proportional to the number of active keys, which can be very large.
We present a sampling framework for unaggregated data that uses a single pass (for streams) or two passes (for distributed data) and structure size proportional to the desired sample size. Our design unifies classic solutions for Distinct and Sum. Specifically, our ℓ-capped samples provide nonnegative unbiased estimates of any monotone non-decreasing frequency statistics and statistical guarantees on quality that are close to gold standard for cap statistics with T=Θ (ℓ). Furthermore, our multi-objective samples provide these statistical guarantees on quality for all concave sub-linear statistics (the nonnegative span of cap functions) while incurring only a logarithmic overhead on sample size.
- N. Alon, Y. Matias, and M. Szegedy. 1999. The space complexity of approximating the frequency moments. J. Comput. System Sci. 58, 1 (1999), 137--147. Google ScholarDigital Library
- V. Braverman and R. Ostrovsky. 2010. Zero-one frequency laws. In Proceedings of the Annual ACM Symposium on Theory of Computing Conference (STOC’10). ACM. Google ScholarDigital Library
- K. R. W. Brewer, L. J. Early, and S. F. Joyce. 1972. Selecting several samples from a single population. Austr. J. Stat. 14, 3 (1972), 231--239.Google ScholarCross Ref
- M. T. Chao. 1982. A general purpose unequal probability sampling plan. Biometrika 69, 3 (1982), 653--656.Google ScholarCross Ref
- E. Cohen. 1997. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci. 55, 3 (1997), 441--453. Google ScholarDigital Library
- E. Cohen. 2014. All-distances sketches, revisited: HIP estimators for massive graphs analysis. In Proceedings of the Pipeline Open Data Standard Conference (PODS’14). ACM. Google ScholarDigital Library
- E. Cohen. 2015. Multi-objective weighted sampling. In Proceedings of the 2015 Third IEEE Workshop on Hot Topics in Web Systems and Technologies (HotWeb'15). IEEE. Google ScholarDigital Library
- E. Cohen. 2015. Stream sampling for frequency cap statistics. In Proceedings of the Knowledge Discovery and Data Mining Conference (KDD’15). ACM, 2015. Google ScholarDigital Library
- E. Cohen. 2017. Hyperloglog hyper extended: Sketches for concave sublinear frequency statistics. In Proceedings of the Knowledge Discovery and Data Mining Conference (KDD’17). ACM, 2017. Google ScholarDigital Library
- E. Cohen, G. Cormode, and N. Duffield. 2012. Don’t let the negatives bring you down: Sampling from streams of signed updates. In Proceedings of the ACM SIGMETRICS/Performance Conference. Google ScholarDigital Library
- E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. 2009. Composable, scalable, and accurate weight summarization of unaggregated datasets. Proc. VLDB 2, 1 (2009), 431--442. Google ScholarDigital Library
- E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. 2014. Algorithms and estimators for accurate summarization of unaggregated data streams. J. Comput. System Sci. 80, 7 (2014), 1214--1244.Google ScholarCross Ref
- E. Cohen, N. Duffield, C. Lund, M. Thorup, and H. Kaplan. 2011. Efficient stream sampling for variance-optimal estimation of subset sums. SIAM J. Comput. 40, 5 (2011). Google ScholarDigital Library
- E. Cohen and H. Kaplan. 2008. Tighter estimation using bottom-k sketches. In Proceedings of the 34th VLDB Conference. Google ScholarDigital Library
- E. Cohen, H. Kaplan, and S. Sen. 2009. Coordinated weighted sampling for estimating aggregates over multiple weight assignments. Proc. VLDB 2, 1--2 (2009). Google ScholarDigital Library
- G. Cormode and S. Muthukrishnan. 2005. An improved data stream summary: The count-min sketch and its applications. J. Algor. 55, 1 (2005). Google ScholarDigital Library
- N. Duffield, M. Thorup, and C. Lund. 2007. Priority sampling for estimating arbitrary subset sums. J. Assoc. Comput. Mach. 54, 6 (2007). Google ScholarDigital Library
- C. Estan and G. Varghese. 2002. New directions in traffic measurement and accounting. In Proceedings of the Conference of the Special Interest Group on Data Communication (SIGCOMM’02). ACM. Google ScholarDigital Library
- W. Feller. 1971. An Introduction to Probability Theory and Its Applications, Vol. 2. John Wiley 8 Sons, New York, NY.Google Scholar
- P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. 2007. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Analysis of Algorithms. DMTCS.Google Scholar
- P. Flajolet and G. N. Martin. 1985. Probabilistic counting algorithms for data base applications. J. Comput. System Sci. 31, 2 (1985), 182--209. Google ScholarDigital Library
- R. Gemulla, W. Lehner, and P. J. Haas. 2006. A dip in the reservoir: Maintaining sample synopses of evolving datasets. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB'06). 595--606. Google ScholarDigital Library
- P. Gibbons and Y. Matias. 1998. New sampling-based summary statistics for improving approximate query answers. In Proceedings of the Conference of the Special Interest Group on Management of Data (SIGMOD’98). ACM. Google ScholarDigital Library
- Google. Frequency capping: AdWords help. Retrieved December 2014 from https://support.google.com/adwords/answer/117579.Google Scholar
- S. Heule, M. Nunkesser, and A. Hall. 2013. HyperLogLog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In Proceedings of the International Conference on Extending Database Technology (EDBT’13). Google ScholarDigital Library
- N. Hohn and D. Veitch. 2003. Inverting sampled traffic. In Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement. 222--233. Google ScholarDigital Library
- D. G. Horvitz and D. J. Thompson. 1952. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47, 260 (1952), 663--685.Google ScholarCross Ref
- P. Indyk. 2001. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proceedings of the 41st IEEE Annual Symposium on Foundations of Computer Science. IEEE, 189--197. Google ScholarDigital Library
- W. Johnson and J. Lindenstrauss. 1984. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Math. 26.Google Scholar
- H. Jowhari, M. Saglam, and G. Tardos. 2011. Tight bounds for Lp samplers, finding duplicates in streams, and related problems. In Proceedings of the Pipeline Open Data Standard Conference (PODS’11). Google ScholarDigital Library
- D. E. Knuth. 1968. The Art of Computer Programming, Vol. 2, Seminumerical Algorithms (1st ed.). Addison-Wesley. Google ScholarDigital Library
- J. Misra and D. Gries. 1982. Finding repeated elements. Technical Report, Cornell University. Google ScholarDigital Library
- M. Monemizadeh and D. P. Woodruff. 2010. one-pass relative-error l<sup>p</sup>-sampling with applications. In Proceedings of the 21st ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM. Google ScholarDigital Library
- E. Ohlsson. 1998. Sequential poisson sampling. J. Off. Stat. 14, 2 (1998), 149--162.Google Scholar
- M. Osborne. Facebook Reach and Frequency Buying. Retrieved October 2014 from http://citizennet.com/blog/2014/10/01/facebook-reach-and-frequency-buying/.Google Scholar
- B. Rosén. 1972. Asymptotic theory for successive sampling with varying probabilities without replacement, I. Ann. Math. Stat. 43, 2 (1972), 373--397.Google ScholarCross Ref
- B. Rosén. 1997. Asymptotic theory for order sampling. J. Stat. Plan. Inf. 62, 2 (1997), 135--158.Google ScholarCross Ref
- M. Szegedy. 2005. Near optimality of the priority sampling procedure. Technical Report TR05-001, Electronic Colloquium on Computational Complexity.Google Scholar
- Y. Tillé. 2006. Sampling Algorithms. Springer-Verlag, New York.Google Scholar
Index Terms
- Stream Sampling Framework and Application for Frequency Cap Statistics
Recommendations
HyperLogLog Hyperextended: Sketches for Concave Sublinear Frequency Statistics
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningOne of the most common statistics computed over data elements is the number of distinct keys. A thread of research pioneered by Flajolet and Martin three decades ago culminated in the design of optimal approximate counting sketches, which have size that ...
On multivariate order statistics. Application to ranked set sampling
Two new concepts of order statistics for multivariate samples are introduced. In one of the versions it turns out that not every multivariate order statistic is present in every sample. These order statistics have application in multivariate ranked set ...
Comments