Heavy hitters via cluster-preserving clustering

Authors:
Kasper Green Larsen

Aarhus University, Aarhus, Denmark

Aarhus University, Aarhus, Denmark
View Profile

,
Jelani Nelson

Harvard University Cambridge, MA

Harvard University Cambridge, MA
View Profile

,
Huy L. Nguyễn

Northeastern University, Boston, MA

Northeastern University, Boston, MA
View Profile

,
Mikkel Thorup

University of Copenhagen, Denmark

University of Copenhagen, Denmark
View Profile

Authors Info & Claims

Communications of the ACM Volume 62 Issue 8August 2019pp 95–100https://doi.org/10.1145/3339185

Published:24 July 2019Publication History

Communications of the ACM

Abstract

We develop a new algorithm for the turnstile heavy hitters problem in general turnstile streams, the EXPANDERSKETCH, which finds the approximate top-k items in a universe of size n using the same asymptotic O(k log n) words of memory and O(log n) update time as the COUNTMIN and COUNTSKETCH, but requiring only O(k poly(log n)) time to answer queries instead of the O(n log n) time of the other two. The notion of "approximation" is the same l₂ sense as the COUNTSKETCH, which given known lower bounds is the strongest guarantee one can achieve in sublinear memory.

Our main innovation is an efficient reduction from the heavy hitters problem to a clustering problem in which each heavy hitter is encoded as some form of noisy spectral cluster in a graph, and the goal is to identify every cluster. Since every heavy hitter must be found, correctness requires that every cluster be found. We thus need a "cluster-preserving clustering" algorithm that partitions the graph into pieces while finding every cluster. To do this we first apply standard spectral graph partitioning, and then we use some novel local search techniques to modify the cuts obtained so as to make sure that the original clusters are sufficiently preserved. Our clustering algorithm may be of broader interest beyond heavy hitters and streaming algorithms.

References

Alon, N., Chung, F.R.K. Explicit construction of linear sized tolerant networks. Discrete Math. 72 (1988), 15--19. Google ScholarDigital Library
Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D. An information statistics approach to data stream and communication complexity. J. Comput. Syst. Sci. 68, 4 (2004), 702--732. Google ScholarDigital Library
Braverman, V., Chestnut, S.R., Ivkin, N., Nelson, J., Wang, Z., Woodruff, D.P. BPTree: An &ell;<sub>2</sub> heavy hitters algorithm using constant memory. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS) (2017), ACM, Chicago, IL, 361--376. Google ScholarDigital Library
Braverman, V., Chestnut, S.R., Ivkin, N., Woodruff, D.P. Beating CountSketch for heavy hitters in insertion streams. In Proceedings of the 48th STOC (2016), ACM, Cambridge, MA. Google ScholarDigital Library
Charikar, M., Chen, K., Farach-Colton, M. Finding frequent items in data streams. Theor. Comput. Sci. 312, 1 (2004), 3--15. Google ScholarDigital Library
Cormode, G., Hadjieleftheriou, M. Finding frequent items in data streams. PVLDB 1, 2 (2008), 1530--1541. Google ScholarDigital Library
Cormode, G., Muthukrishnan, S. An improved data stream summary: The count-min sketch and its applications. J. Algorithms 55, 1 (2005), 58--75. Google ScholarDigital Library
Gilbert, A.C., Li, Y., Porat, E., Strauss, M.J. For-all sparse recovery in near-optimal time. In Proceedings of the 41st ICALP (2014), Springer, Copenhagen, Denmark, 538--550.Google ScholarCross Ref
Jowhari, H., Saglam, M., Tardos, G. Tight bounds for Lp samplers, finding duplicates in streams, and related problems. In Proceedings of the 30th PODS (2011), ACM, Athens, Greece, 49--58. Google ScholarDigital Library
Kannan, R., Vempala, S., Vetta, A. On clusterings: Good, bad and spectral. J. ACM 51, 3 (2004), 497--515. Google ScholarDigital Library
Larsen, K.G., Nelson, J., Nguyễn, H.L., Thorup, M. Heavy hitters via cluster-preserving clustering. CoRR, abs/1511.01111 (2016).Google Scholar
Metwally, A., Agrawal, D., El Abbadi, A. Efficient computation of frequent and top-k elements in data streams, In Proceedings of the 10th ICDT (2005), Springer, Edinburgh, UK, 398--412. Google ScholarDigital Library
Misra, J., Gries, D. Finding repeated elements. Sci. Comput. Program 2, 2 (1982), 143--152.Google ScholarCross Ref
Orecchia, L., Sachdeva, S., Vishnoi, N.K. Approximating the exponential, the Lanczos method and an Õ(m)-time spectral algorithm for balanced separator. In Proceedings of the 44th STOC (2012), 1141--1160. Google ScholarDigital Library
Orecchia, L., Vishnoi, N.K. Towards an SDP-based approach to spectral methods: A nearly-linear-time algorithm for graph partitioning and decomposition. In Proceedings of the 22nd SODA (2011), SIAM, San Francisco, CA, 532--545. Google ScholarDigital Library
Spielman, D.A. Linear-time encodable and decodable error-correcting codes. IEEE Trans. Information Theory 42, 6 (1996), 1723--1731. Google ScholarCross Ref

Index Terms

Heavy hitters via cluster-preserving clustering
1. Theory of computation
  1. Design and analysis of algorithms
    1. Algorithm design techniques
    2. Approximation algorithms analysis
      1. Facility location and clustering

Recommendations

Beating CountSketch for heavy hitters in insertion streams
STOC '16: Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

Given a stream p₁, …, p_m of items from a universe U, which, without loss of generality we identify with the set of integers {1, 2, …, n}, we consider the problem of returning all ℓ₂-heavy hitters, i.e., those items j for which f_j ≥ є √F₂, where f_j is ...
Read More
Identifying correlated heavy-hitters in a two-dimensional data stream

We consider online mining of correlated heavy-hitters (CHH) from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes an ...
Read More
Finding Subcube Heavy Hitters in Analytics Data Streams
WWW '18: Proceedings of the 2018 World Wide Web Conference

Modern data streams typically have high dimensionality. For example, digital analytics streams consist of user online activities (e.g., web browsing activity, commercial site activity, apps and social behavior, and response to ads). An important problem ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Communications of the ACM Volume 62, Issue 8
August 2019
88 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/3351434
Editor:
Andrew A. Chien
Association for Computing Machinery, New York, NY
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 July 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 25,870
  Total Downloads
- Downloads (Last 12 months)15,728
- Downloads (Last 6 weeks)46
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Heavy hitters via cluster-preserving clustering

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Beating CountSketch for heavy hitters in insertion streams

Identifying correlated heavy-hitters in a two-dimensional data stream

Finding Subcube Heavy Hitters in Analytics Data Streams