Abstract
We develop a new algorithm for the turnstile heavy hitters problem in general turnstile streams, the EXPANDERSKETCH, which finds the approximate top-k items in a universe of size n using the same asymptotic O(k log n) words of memory and O(log n) update time as the COUNTMIN and COUNTSKETCH, but requiring only O(k poly(log n)) time to answer queries instead of the O(n log n) time of the other two. The notion of "approximation" is the same l2 sense as the COUNTSKETCH, which given known lower bounds is the strongest guarantee one can achieve in sublinear memory.
Our main innovation is an efficient reduction from the heavy hitters problem to a clustering problem in which each heavy hitter is encoded as some form of noisy spectral cluster in a graph, and the goal is to identify every cluster. Since every heavy hitter must be found, correctness requires that every cluster be found. We thus need a "cluster-preserving clustering" algorithm that partitions the graph into pieces while finding every cluster. To do this we first apply standard spectral graph partitioning, and then we use some novel local search techniques to modify the cuts obtained so as to make sure that the original clusters are sufficiently preserved. Our clustering algorithm may be of broader interest beyond heavy hitters and streaming algorithms.
- Alon, N., Chung, F.R.K. Explicit construction of linear sized tolerant networks. Discrete Math. 72 (1988), 15--19. Google ScholarDigital Library
- Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D. An information statistics approach to data stream and communication complexity. J. Comput. Syst. Sci. 68, 4 (2004), 702--732. Google ScholarDigital Library
- Braverman, V., Chestnut, S.R., Ivkin, N., Nelson, J., Wang, Z., Woodruff, D.P. BPTree: An ℓ<sub>2</sub> heavy hitters algorithm using constant memory. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS) (2017), ACM, Chicago, IL, 361--376. Google ScholarDigital Library
- Braverman, V., Chestnut, S.R., Ivkin, N., Woodruff, D.P. Beating CountSketch for heavy hitters in insertion streams. In Proceedings of the 48th STOC (2016), ACM, Cambridge, MA. Google ScholarDigital Library
- Charikar, M., Chen, K., Farach-Colton, M. Finding frequent items in data streams. Theor. Comput. Sci. 312, 1 (2004), 3--15. Google ScholarDigital Library
- Cormode, G., Hadjieleftheriou, M. Finding frequent items in data streams. PVLDB 1, 2 (2008), 1530--1541. Google ScholarDigital Library
- Cormode, G., Muthukrishnan, S. An improved data stream summary: The count-min sketch and its applications. J. Algorithms 55, 1 (2005), 58--75. Google ScholarDigital Library
- Gilbert, A.C., Li, Y., Porat, E., Strauss, M.J. For-all sparse recovery in near-optimal time. In Proceedings of the 41st ICALP (2014), Springer, Copenhagen, Denmark, 538--550.Google ScholarCross Ref
- Jowhari, H., Saglam, M., Tardos, G. Tight bounds for Lp samplers, finding duplicates in streams, and related problems. In Proceedings of the 30th PODS (2011), ACM, Athens, Greece, 49--58. Google ScholarDigital Library
- Kannan, R., Vempala, S., Vetta, A. On clusterings: Good, bad and spectral. J. ACM 51, 3 (2004), 497--515. Google ScholarDigital Library
- Larsen, K.G., Nelson, J., Nguyễn, H.L., Thorup, M. Heavy hitters via cluster-preserving clustering. CoRR, abs/1511.01111 (2016).Google Scholar
- Metwally, A., Agrawal, D., El Abbadi, A. Efficient computation of frequent and top-k elements in data streams, In Proceedings of the 10th ICDT (2005), Springer, Edinburgh, UK, 398--412. Google ScholarDigital Library
- Misra, J., Gries, D. Finding repeated elements. Sci. Comput. Program 2, 2 (1982), 143--152.Google ScholarCross Ref
- Orecchia, L., Sachdeva, S., Vishnoi, N.K. Approximating the exponential, the Lanczos method and an Õ(m)-time spectral algorithm for balanced separator. In Proceedings of the 44th STOC (2012), 1141--1160. Google ScholarDigital Library
- Orecchia, L., Vishnoi, N.K. Towards an SDP-based approach to spectral methods: A nearly-linear-time algorithm for graph partitioning and decomposition. In Proceedings of the 22nd SODA (2011), SIAM, San Francisco, CA, 532--545. Google ScholarDigital Library
- Spielman, D.A. Linear-time encodable and decodable error-correcting codes. IEEE Trans. Information Theory 42, 6 (1996), 1723--1731. Google ScholarCross Ref
Index Terms
- Heavy hitters via cluster-preserving clustering
Recommendations
Beating CountSketch for heavy hitters in insertion streams
STOC '16: Proceedings of the forty-eighth annual ACM symposium on Theory of ComputingGiven a stream p1, …, pm of items from a universe U, which, without loss of generality we identify with the set of integers {1, 2, …, n}, we consider the problem of returning all ℓ2-heavy hitters, i.e., those items j for which fj ≥ є √F2, where fj is ...
Identifying correlated heavy-hitters in a two-dimensional data stream
We consider online mining of correlated heavy-hitters (CHH) from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes an ...
Finding Subcube Heavy Hitters in Analytics Data Streams
WWW '18: Proceedings of the 2018 World Wide Web ConferenceModern data streams typically have high dimensionality. For example, digital analytics streams consist of user online activities (e.g., web browsing activity, commercial site activity, apps and social behavior, and response to ads). An important problem ...
Comments