skip to main content
research-article
Open Access

Stream Sampling Framework and Application for Frequency Cap Statistics

Published:24 September 2018Publication History
Skip Abstract Section

Abstract

Unaggregated data, in a streamed or distributed form, are prevalent and come from diverse sources such as interactions of users with web services and IP traffic. Data elements have keys (cookies, users, queries), and elements with different keys interleave. Analytics on such data typically utilizes statistics expressed as a sum over keys in a specified segment of a function f applied to the frequency (the total number of occurrences) of the key. In particular, Distinct is the number of active keys in the segment, Sum is the sum of their frequencies, and both are special cases of frequency cap statistics, which cap the frequency by a parameter T.

Random samples can be very effective for quick and efficient estimation of statistics at query time. Ideally, to estimate statistics for a given function f, our sample would include a key with frequency w with probability roughly proportional to f(w). The challenge is that while such “gold-standard” samples can be easily computed after aggregating the data (computing the set of key-frequency pairs), this aggregation is costly: It requires structure of size that is proportional to the number of active keys, which can be very large.

We present a sampling framework for unaggregated data that uses a single pass (for streams) or two passes (for distributed data) and structure size proportional to the desired sample size. Our design unifies classic solutions for Distinct and Sum. Specifically, our ℓ-capped samples provide nonnegative unbiased estimates of any monotone non-decreasing frequency statistics and statistical guarantees on quality that are close to gold standard for cap statistics with T=Θ (ℓ). Furthermore, our multi-objective samples provide these statistical guarantees on quality for all concave sub-linear statistics (the nonnegative span of cap functions) while incurring only a logarithmic overhead on sample size.

References

  1. N. Alon, Y. Matias, and M. Szegedy. 1999. The space complexity of approximating the frequency moments. J. Comput. System Sci. 58, 1 (1999), 137--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. V. Braverman and R. Ostrovsky. 2010. Zero-one frequency laws. In Proceedings of the Annual ACM Symposium on Theory of Computing Conference (STOC’10). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. R. W. Brewer, L. J. Early, and S. F. Joyce. 1972. Selecting several samples from a single population. Austr. J. Stat. 14, 3 (1972), 231--239.Google ScholarGoogle ScholarCross RefCross Ref
  4. M. T. Chao. 1982. A general purpose unequal probability sampling plan. Biometrika 69, 3 (1982), 653--656.Google ScholarGoogle ScholarCross RefCross Ref
  5. E. Cohen. 1997. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci. 55, 3 (1997), 441--453. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. Cohen. 2014. All-distances sketches, revisited: HIP estimators for massive graphs analysis. In Proceedings of the Pipeline Open Data Standard Conference (PODS’14). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. Cohen. 2015. Multi-objective weighted sampling. In Proceedings of the 2015 Third IEEE Workshop on Hot Topics in Web Systems and Technologies (HotWeb'15). IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. Cohen. 2015. Stream sampling for frequency cap statistics. In Proceedings of the Knowledge Discovery and Data Mining Conference (KDD’15). ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Cohen. 2017. Hyperloglog hyper extended: Sketches for concave sublinear frequency statistics. In Proceedings of the Knowledge Discovery and Data Mining Conference (KDD’17). ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. E. Cohen, G. Cormode, and N. Duffield. 2012. Don’t let the negatives bring you down: Sampling from streams of signed updates. In Proceedings of the ACM SIGMETRICS/Performance Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. 2009. Composable, scalable, and accurate weight summarization of unaggregated datasets. Proc. VLDB 2, 1 (2009), 431--442. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. 2014. Algorithms and estimators for accurate summarization of unaggregated data streams. J. Comput. System Sci. 80, 7 (2014), 1214--1244.Google ScholarGoogle ScholarCross RefCross Ref
  13. E. Cohen, N. Duffield, C. Lund, M. Thorup, and H. Kaplan. 2011. Efficient stream sampling for variance-optimal estimation of subset sums. SIAM J. Comput. 40, 5 (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. E. Cohen and H. Kaplan. 2008. Tighter estimation using bottom-k sketches. In Proceedings of the 34th VLDB Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. Cohen, H. Kaplan, and S. Sen. 2009. Coordinated weighted sampling for estimating aggregates over multiple weight assignments. Proc. VLDB 2, 1--2 (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Cormode and S. Muthukrishnan. 2005. An improved data stream summary: The count-min sketch and its applications. J. Algor. 55, 1 (2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. N. Duffield, M. Thorup, and C. Lund. 2007. Priority sampling for estimating arbitrary subset sums. J. Assoc. Comput. Mach. 54, 6 (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Estan and G. Varghese. 2002. New directions in traffic measurement and accounting. In Proceedings of the Conference of the Special Interest Group on Data Communication (SIGCOMM’02). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. Feller. 1971. An Introduction to Probability Theory and Its Applications, Vol. 2. John Wiley 8 Sons, New York, NY.Google ScholarGoogle Scholar
  20. P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. 2007. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Analysis of Algorithms. DMTCS.Google ScholarGoogle Scholar
  21. P. Flajolet and G. N. Martin. 1985. Probabilistic counting algorithms for data base applications. J. Comput. System Sci. 31, 2 (1985), 182--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Gemulla, W. Lehner, and P. J. Haas. 2006. A dip in the reservoir: Maintaining sample synopses of evolving datasets. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB'06). 595--606. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Gibbons and Y. Matias. 1998. New sampling-based summary statistics for improving approximate query answers. In Proceedings of the Conference of the Special Interest Group on Management of Data (SIGMOD’98). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Google. Frequency capping: AdWords help. Retrieved December 2014 from https://support.google.com/adwords/answer/117579.Google ScholarGoogle Scholar
  25. S. Heule, M. Nunkesser, and A. Hall. 2013. HyperLogLog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In Proceedings of the International Conference on Extending Database Technology (EDBT’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. N. Hohn and D. Veitch. 2003. Inverting sampled traffic. In Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement. 222--233. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. G. Horvitz and D. J. Thompson. 1952. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47, 260 (1952), 663--685.Google ScholarGoogle ScholarCross RefCross Ref
  28. P. Indyk. 2001. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proceedings of the 41st IEEE Annual Symposium on Foundations of Computer Science. IEEE, 189--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. W. Johnson and J. Lindenstrauss. 1984. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Math. 26.Google ScholarGoogle Scholar
  30. H. Jowhari, M. Saglam, and G. Tardos. 2011. Tight bounds for Lp samplers, finding duplicates in streams, and related problems. In Proceedings of the Pipeline Open Data Standard Conference (PODS’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. E. Knuth. 1968. The Art of Computer Programming, Vol. 2, Seminumerical Algorithms (1st ed.). Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Misra and D. Gries. 1982. Finding repeated elements. Technical Report, Cornell University. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Monemizadeh and D. P. Woodruff. 2010. one-pass relative-error l<sup>p</sup>-sampling with applications. In Proceedings of the 21st ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. E. Ohlsson. 1998. Sequential poisson sampling. J. Off. Stat. 14, 2 (1998), 149--162.Google ScholarGoogle Scholar
  35. M. Osborne. Facebook Reach and Frequency Buying. Retrieved October 2014 from http://citizennet.com/blog/2014/10/01/facebook-reach-and-frequency-buying/.Google ScholarGoogle Scholar
  36. B. Rosén. 1972. Asymptotic theory for successive sampling with varying probabilities without replacement, I. Ann. Math. Stat. 43, 2 (1972), 373--397.Google ScholarGoogle ScholarCross RefCross Ref
  37. B. Rosén. 1997. Asymptotic theory for order sampling. J. Stat. Plan. Inf. 62, 2 (1997), 135--158.Google ScholarGoogle ScholarCross RefCross Ref
  38. M. Szegedy. 2005. Near optimality of the priority sampling procedure. Technical Report TR05-001, Electronic Colloquium on Computational Complexity.Google ScholarGoogle Scholar
  39. Y. Tillé. 2006. Sampling Algorithms. Springer-Verlag, New York.Google ScholarGoogle Scholar

Index Terms

  1. Stream Sampling Framework and Application for Frequency Cap Statistics

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Algorithms
      ACM Transactions on Algorithms  Volume 14, Issue 4
      October 2018
      445 pages
      ISSN:1549-6325
      EISSN:1549-6333
      DOI:10.1145/3266298
      Issue’s Table of Contents

      Copyright © 2018 Owner/Author

      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 September 2018
      • Accepted: 1 June 2018
      • Revised: 1 December 2017
      • Received: 1 March 2017
      Published in talg Volume 14, Issue 4

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format