skip to main content
research-article

Mergeable summaries

Published:04 December 2013Publication History
Skip Abstract Section

Abstract

We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two datasets, there is a way to merge the two summaries into a single summary on the two datasets combined together, while preserving the error and size guarantees. This property means that the summaries can be merged in a way akin to other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the datasets. But some other fundamental ones, like those for heavy hitters and quantiles, are not (known to be) mergeable. In this article, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for ε-approximate heavy hitters, there is a deterministic mergeable summary of size O(1/ε); for ε-approximate quantiles, there is a deterministic summary of size O((1/ε) log(ε n)) that has a restricted form of mergeability, and a randomized one of size O((1/ε) log3/2(1/ε)) with full mergeability. We also extend our results to geometric summaries such as ε-approximations which permit approximate multidimensional range counting queries. While most of the results in this article are theoretical in nature, some of the algorithms are actually very simple and even perform better than the previously best known algorithms, which we demonstrate through experiments in a simulated sensor network.

We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for ε-approximate quantiles that depends only on ε, of size O((1/ε) log3/2(1/ε)), and (2) we demonstrate that the MG and the SpaceSaving summaries for heavy hitters are isomorphic.

References

  1. Agarwal, P. K., Cormode, G., Huang, Z., Phillips, J. M., Wei, Z., and Yi, K. 2012. Mergeable summaries. In Proceedings of the 31st ACM Symposium on Principals of Database Systems. 23--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ahn, K. J., Guha, S., and McGregor, A. 2012. Analyzing graph structure via linear measurements. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Alon, N., Matias, Y., and Szegedy, M. 1999. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 1, 137--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bansal, N. 2010. Constructive algorithms for discrepancy minimization. In Proceedings of the IEEE Symposium on Foundations of Computer Science. 3--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bansal, N. 2012. Semidefinite optimization in discrepancy theory. Math. Program. 134, 1, 5--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bar-Yossef, Z., Jayram, T. S., Kumar, R., Sivakumar, D., and Trevisan, L. 2002. Counting distinct elements in a data stream. In Proceedings of the 6th International Workshop on Randomization and Approximation Techniques in Computer Science (RandOM'02). 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Berinde, R., Cormode, G., Indyk, P., and Strauss, M. 2010. Space-optimal heavy hitters with strong error bounds. ACM Trans. Datab. Syst. 35, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chazelle, B. 2000. The Discrepancy Method: Randomness and Complexity. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chazelle, B. and Matousek, J. 1996. On linear-time deterministic algorithms for optimization problems in fixed dimension. J. Algor. 21, 3, 579--597. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cormode, G. and Hadjieleftheriou, M. 2008a. Finding frequent items in data streams. Proc. VLDB Endow. 1, 2, 1530--1541. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Cormode, G. and Hadjieleftheriou, M. 2008b. Finding frequent items in data streams. In Proceedings of the International Conference on Very Large Data Bases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Cormode, G. and Muthukrishnan, S. 2005. An improved data stream summary: The count-min sketch and its applications. J. Algor. 55, 1, 58--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Feigenbaum, J., Kannan, S., Strauss, M. J., and Viswanathan, M. 2003. An approximate l1-difference algorithm for massive data streams. SIAM J. Comput. 32, 1, 131--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Feldman, J., Muthukrishnan, S., Sidiropoulos, A., Stein, C., and Svitkina, Z. 2008. On distributing symmetric streaming computations. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. J. 2002. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the International Conference on Very Large Data Bases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Greenwald, M. and Khanna, S. 2001. Space-efficient online computation of quantile summaries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Greenwald, M. and Khanna, S. 2004. Power conserving computation of order-statistics over sensor networks. In Proceedings of the ACM Symposium on Principles of Database Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Guha, S. 2009. Tight results for clustering and summarizing data streams. In Proceedings of the International Conference on Database Theory. ACM Press, New York, 268--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Guha, S., Mishra, N., Motwani, R., and O'Callaghan, L. 2000. Clustering data streams. In Proceedings of the IEEE Symposium on Foundations of Computer Science. 359--366. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Huang, Z., Wang, L., Yi, K., and Liu, Y. 2011. Sampling based algorithms for quantile computation in sensor networks. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Indyk, P. 2006. Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. ACM 53, 307--323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kane, D. M., Nelson, J., Porat, E., and Woodruff, D. P. 2011. Fast moment estimation in data streams in optimal space. In Proceedings of the 43rd Annual ACM Symposium on Theory of Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Larsen, K. 2011. On range searching in the group model and combinatorial discrepancy. In Proceedings of the IEEE Symposium on Foundations of Computer Science. 542--549. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Li, Y., Long, P., and Srinivasan, A. 2001. Improved bounds on the sample complexity of learning. J. Comput. Syst. Sci. 62, 3, 516--527. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Lovett, S. and Meka, R. 2012. Constructive discrepancy minimization by walking on the edges. In Proceedings of the 53rd Annual IEEE Symposium on Foundations of Computer Science. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Madden, S., Franklin, M. J., Hellerstein, J. M., and Hong, W. 2002. TAG: A tiny aggregation service for ad-hoc sensor networks. In Proceedings of the Symposium on Operating Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Manjhi, A., Nath, S., and Gibbons, P. B. 2005a. Tributaries and deltas: Efficient and robust aggregation in sensor network streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Manjhi, A., Shkapenyuk, V., Dhamdhere, K., and Olston, C. 2005b. Finding (recently) frequent items in distributed data streams. In Proceedings of the IEEE International Conference on Data Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Manku, G. S., Rajagopalan, S., and Lindsay, B. G. 1998. Approximate medians and other quantiles in one pass and with limited memory. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Matousek, J. 1991. Approximations and optimal geometric divide-and-conquer. In Proceedings of the ACM Symposium on Theory of Computing. ACM Press, New York, 505--511. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Matousek, J. 1995. Tight upper bounds for the discrepancy of half-spaces. Discr. Comput. Geom. 13, 593--601.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Matousek, J. 2010. Geometric Discrepancy: An Illustrated Guide, vol. 18. Springer http://bookshelf.theopensourcelibrary.org/2010_CharlesUniversity_GeometricDiscrepancy.pdf.Google ScholarGoogle Scholar
  33. Metwally, A., Agrawal, D., and Abbadi, A. 2006. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Datab. Syst. 31, 3, 1095--1133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Misra, J. and Gries, D. 1982. Finding repeated elements. Sci. Comput. Program. 2, 2, 143--152.Google ScholarGoogle ScholarCross RefCross Ref
  35. Nelson, J., Nguyen, H. L., and Woodruff, D. P. 2012. On deterministic sketching and streaming for sparse recovery and norm estimation. In Proceedings of the 16th International Workshop on Randomization and Computation (RandOM'12).Google ScholarGoogle Scholar
  36. Phillips, J. 2008. Algorithms for approximations of terrains. In Proceedings of the 35th International Colloquium on Automata, Languages and Programming (ICALP'08). 447--458. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Shrivastava, N., Buragohain, C., Agrawal, D., and Suri, S. 2004. Medians and beyond: New aggregation techniques for sensor networks. In Proceedings of the 2nd International Conference on Embedded Networked Sensor Systems (SenSys'04). 239-249. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Suri, S., Toth, C., and Zhou, Y. 2006. Range counting over multidimensional data streams. Discr. Comput. Geom. 36, 4, 633--655. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Talagrand, M. 1994. Sharper bounds for gaussian and empirical processes. Ann. Probab. 22, 1, 28--76.Google ScholarGoogle ScholarCross RefCross Ref
  40. Vapnik, V. and Chervonenkis, A. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264--280.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Mergeable summaries

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Database Systems
      ACM Transactions on Database Systems  Volume 38, Issue 4
      Invited papers issue
      November 2013
      294 pages
      ISSN:0362-5915
      EISSN:1557-4644
      DOI:10.1145/2539032
      Issue’s Table of Contents

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 4 December 2013
      • Accepted: 1 June 2013
      • Revised: 1 April 2013
      • Received: 1 October 2012
      Published in tods Volume 38, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader