Abstract
A fundamental problem in data management is to draw and maintain a sample of a large data set, for approximate query answering, selectivity estimation, and query planning. With large, streaming data sets, this problem becomes particularly difficult when the data is shared across multiple distributed sites. The main challenge is to ensure that a sample is drawn uniformly across the union of the data while minimizing the communication needed to run the protocol on the evolving data. At the same time, it is also necessary to make the protocol lightweight, by keeping the space and time costs low for each participant. In this article, we present communication-efficient protocols for continuously maintaining a sample (both with and without replacement) from k distributed streams. These apply to the case when we want a sample from the full streams, and to the sliding window cases of only the W most recent elements, or arrivals within the last w time units. We show that our protocols are optimal (up to logarithmic factors), not just in terms of the communication used, but also the time and space costs for each participant.
- Arackaparambil, C., Brody, J., and Chakrabarti, A. 2009. Functional monitoring without monotonicity. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP). Google ScholarDigital Library
- Babcock, B., Datar, M., and Motwani, R. 2002. Sampling from a moving window over streaming data. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. 633--634. Google ScholarDigital Library
- Babcock, B. and Olston, C. 2003. Distributed top-k monitoring. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Braverman, V., Ostrovsky, R., and Zaniolo, C. 2009. Optimal sampling from sliding windows. In Proceedings of the ACM Principles of Database Systems. Google ScholarDigital Library
- Chazelle, B. 2000. The Discrepancy Method. Cambridge University Press. Google ScholarDigital Library
- Cormode, G. and Garofalakis, M. 2005. Sketching streams through the net: Distributed approximate query tracking. In Proceedings of the International Conference on Very Large Data Bases. Google ScholarDigital Library
- Cormode, G., Garofalakis, M., Muthukrishnan, S., and Rastogi, R. 2005. Holistic aggregates in a networked world: Distributed tracking of approximate quantiles. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Cormode, G., Muthukrishnan, S., and Yi, K. 2008. Algorithms for distributed, functional monitoring. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. Google ScholarDigital Library
- Cormode, G., Muthukrishnan, S., and Zhuang, W. 2006. What's different: Distributed, continuous monitoring of duplicate resilient aggregates on data streams. In Proceedings of the IEEE International Conference on Data Engineering. Google ScholarDigital Library
- Duffield, N., Lund, C., and Thorup, M. 2003. Estimating flow distributions from sampled flow statistics. In Proceedings of ACM SIGCOMM. Google ScholarDigital Library
- Efraimidis, P. S. and Spirakis, P. G. 2006. Weighted random sampling with a reservoir. Inf. Process. Lett. 97, 181--185. Google ScholarDigital Library
- Frahling, G., Indyk, P., and Sohler, C. 2005. Sampling in dynamic data streams and applications. In Proceedings of the Symposium on Computational Geometry. Google ScholarDigital Library
- Gemulla, R. and Lehner, W. 2008. Sampling time-based sliding windows in bounded space. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 379--392. Google ScholarDigital Library
- Gemulla, R., Lehner, W., and Haas, P. J. 2007. Maintaining Bernoulli samples over evolving multisets. In Proceedings of the ACM Principles of Database Systems. 93--102. Google ScholarDigital Library
- Gemulla, R., Lehner, W., and Haas, P. J. 2008. Maintaining bounded-size sample synopses of evolving datasets. VLDB J. 17, 2, 173--202. Google ScholarDigital Library
- Gibbons, P. 2001. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proceedings of the International Conference on Very Large Data Bases. 541--550. Google ScholarDigital Library
- Gibbons, P. and Matias, Y. 1998. New sampling-based summary statistics for improving approximate query answers. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 331--342. Google ScholarDigital Library
- Greenwald, M. and Khanna, S. 2001. Space-efficient online computation of quantile summaries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Haussler, D. and Welzl, E. 1987. Epsilon-nets and simplex range queries. Disc. Computat. Geom. 2, 127--151.Google ScholarDigital Library
- Huang, L., Nguyen, X., Garofalakis, M., Hellerstein, J., Joseph, A. D., Jordan, M., and Taft, N. 2007. Communication-efficient online detection of network-wide anomalies. In Proceedings of the IEEE INFOCOMM.Google Scholar
- Keralapura, R., Cormode, G., and Ramamirtham, J. 2006. Communication-efficient distributed monitoring of thresholded counts. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Knuth, D. E. 1998. The Art of Computer Programming, Vol. 2, Seminumerical Algorithms 2nd Ed. Addison-Wesley. Google ScholarDigital Library
- Manjhi, A., Shkapenyuk, V., Dhamdhere, K., and Olston, C. 2005. Finding (recently) frequent items in distributed data streams. In Proceedings of the IEEE International Conference on Data Engineering. 767--778. Google ScholarDigital Library
- Metwally, A., Agrawal, D., and Abbadi, A. E. 2006. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Datab. Syst. 31, 3, 1095--1133. Google ScholarDigital Library
- Muthukrishnan, S. 2003. Data streams: Algorithms and applications. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. Google ScholarDigital Library
- Muthukrishnan, S. 2009. Stochastic data streams. In Proceedings of the Symposium on Mathematical Foundations of Computer Science. Google ScholarDigital Library
- Olken, F. 1997. Random sampling from databases. Ph.D. thesis, Berkeley.Google Scholar
- Sharfman, I., Schuster, A., and Keren, D. 2006. A geometric approach to monitoring threshold functions over distributed data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Sharfman, I., Schuster, A., and Keren, D. 2010. Shape sensitive geometric monitoring. In Proceedings of the Symposium on ACM Principles of Database Systems. Google ScholarDigital Library
- Vapnik, V. N. and Chervonenkis, A. Y. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory Prob. Appl. 16, 264--280.Google ScholarCross Ref
- Vitter, J. S. 1985. Random sampling with a reservoir. ACM Trans. Math. Softw. 11, 1, 37--57. Google ScholarDigital Library
- Yi, K. and Zhang, Q. 2009. Optimal tracking of distributed heavy hitters and quantiles. In Proceedings of the Symposium on ACM Principles of Database Systems. 167--174. Google ScholarDigital Library
Index Terms
- Continuous sampling from distributed streams
Recommendations
Optimal sampling from distributed streams
PODS '10: Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsA fundamental problem in data management is to draw a sample of a large data set, for approximate query answering, selectivity estimation, and query planning. With large, streaming data sets, this problem becomes particularly difficult when the data is ...
Weighted Reservoir Sampling from Distributed Streams
PODS '19: Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsWe consider message-efficient continuous random sampling from a distributed stream, where the probability of inclusion of an item in the sample is proportional to a weight associated with the item. The unweighted version, where all weights are equal, is ...
Sampling over Union of Joins
SIGMOD '23: Companion of the 2023 International Conference on Management of DataData scientists often draw on multiple relational data sources for analysis. A standard assumption in learning and approximate query answering is that the data is a uniform and independent sample of the underlying distribution. To avoid the cost of join ...
Comments