ABSTRACT
While traditional database systems optimize for performance on one-shot queries, emerging large-scale monitoring applications require continuous tracking of complex aggregates and data-distribution summaries over collections of physically-distributed streams. Thus, effective solutions have to be simultaneously space efficient (at each remote site), communication efficient (across the underlying communication network), and provide continuous, guaranteed-quality estimates. In this paper, we propose novel algorithmic solutions for the problem of continuously tracking complex holistic aggregates in such a distributed-streams setting --- our primary focus is on approximate quantile summaries, but our approach is more broadly applicable and can handle other holistic-aggregate functions (e.g., "heavy-hitters" queries). We present the first known distributed-tracking schemes for maintaining accurate quantile estimates with provable approximation guarantees, while simultaneously optimizing the storage space at each remote site as well as the communication cost across the network. In a nutshell, our algorithms employ a combination of local tracking at remote sites and simple prediction models for local site behavior in order to produce highly communication- and space-efficient solutions. We perform extensive experiments with real and synthetic data to explore the various tradeoffs and understand the role of prediction models in our schemes. The results clearly validate our approach, revealing significant savings over naive solutions as well as our analytical worst-case guarantees.
- N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. "Tracking Join and Self-Join Sizes in Limited Storage". In Proceedings of ACM PODS, 1999. Google ScholarDigital Library
- N. Alon, Y. Matias, and M. Szegedy. "The Space Complexity of Approximating the Frequency Moments". In Proceedings of ACM STOC, 1996. Google ScholarDigital Library
- B. Babcock and C. Olston. "Distributed Top-K Monitoring". In Proceedings of ACM SIGMOD, 2003. Google ScholarDigital Library
- G. Cormode and S. Muthukrishnan. "An improved data stream summary: The count-min sketch and its applications". In Proceedings of Latin American Informatics, 2004.Google ScholarCross Ref
- G. Cormode and S. Muthukrishnan. "What's Hot and What's Not: Tracking Most Frequent Items Dynamically". In Proceedings of ACM PODS, 2003. Google ScholarDigital Library
- C. Cranor, T. Johnson, O. Spatscheck, and V. Shkapenyuk. "Gigascope: A Stream Database for Network Applications". In Proceedings of ACM SIGMOD, 2003. Google ScholarDigital Library
- Dartmouth wireless traces. (http://cmc.cs.dartmouth.-edu/data/dartmouth.html).Google Scholar
- A. Das, S. Ganguly, M. Garofalakis, and R. Rastogi. "Distributed Set-Expression Cardinality Estimation". In Proceedings of VLDB, 2004. Google ScholarDigital Library
- A. Deshpande, C. Guestrin, S. R. Madden, J. M. Hellerstein, and W. Hong. "Model-Driven Data Acquisition in Sensor Networks". In Proceedings of VLDB, 2004. Google ScholarDigital Library
- S. Ganguly, M. Garofalakis, and R. Rastogi. "Processing Set Expressions over Continuous Update Streams". In Proceedings of ACM SIGMOD, 2003. Google ScholarDigital Library
- P. B. Gibbons. "Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports". In Proceedings of VLDB, 2001. Google ScholarDigital Library
- A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. "Surfing Wavelets on Streams: One-pass Summaries for Approximate Aggregate Queries". In Proceedings of VLDB, 2001. Google ScholarDigital Library
- A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. "How to Summarize the Universe: Dynamic Maintenance of Quantiles". In Proceedings of VLDB, 2002. Google ScholarDigital Library
- M. B. Greenwald and S. Khanna. "Space-Efficient Online Computation of Quantile Summaries". In Proceedings of ACM SIGMOD, 2001. Google ScholarDigital Library
- M. B. Greenwald and S. Khanna. "Power-Conserving Computation of Order-Statistics over Sensor Networks". In Proceedings of ACM PODS, 2004. Google ScholarDigital Library
- G. H. Hardy, J. E. Littlewood, and G. Pólya. "Inequalities". Cambridge University Press, 1988. (Second Edition).Google Scholar
- Internet traffic archive. (http://ita.ee.lbl.gov/).Google Scholar
- S. R. Madden, M. J. Franklin, J. M. Hellerstein, and H. Wei. "The Design of an Acquisitional Query Processor for Sensor Networks". In Proceedings of ACM SIGMOD, 2003. Google ScholarDigital Library
- A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. "Finding (Recently) Frequent Items in Distributed Data Streams". In Proceedings of IEEE ICDE, 2005. Google ScholarDigital Library
- G. S. Manku, S. Rajagopalan, and B. G. Lindsey. "Approximate medians and other quantiles in one pass and with limited memory". In Proceedings of ACM SIGMOD, 1998. Google ScholarDigital Library
- G. S. Manku and R. Motwani. "Approximate Frequency Counts over Data Streams". In Proceedings of VLDB, 2002. Google ScholarDigital Library
- C. Olston, J. Jiang, and J. Widom. "Adaptive Filters for Continuous Queries over Distributed Data Streams". In Proceedings of ACM SIGMOD, 2003. Google ScholarDigital Library
- N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri. "Medians and beyond: New aggregation techniques for sensor networks". In Proceedings of ACM SenSys, 2004. Google ScholarDigital Library
Recommendations
Incremental computation of common windowed holistic aggregates
Windowed aggregates are a SQL 2003 feature for computing aggregates in moving windows. Common examples include cumulative sums, local maxima and moving quantiles. With the advent over the last few years of easy-to-use data analytics tools, these ...
On computing correlated aggregates over continual data streams
In many applications from telephone fraud detection to network management, data arrives in a stream, and there is a need to maintain a variety of statistical summary information about a large number of customers in an online fashion. At present, such ...
On computing correlated aggregates over continual data streams
SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of dataIn many applications from telephone fraud detection to network management, data arrives in a stream, and there is a need to maintain a variety of statistical summary information about a large number of customers in an online fashion. At present, such ...
Comments