skip to main content
10.1145/3219819.3219978acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

HeavyGuardian: Separate and Guard Hot Items in Data Streams

Published:19 July 2018Publication History

ABSTRACT

Data stream processing is a fundamental issue in many fields, such as data mining, databases, network traffic measurement. There are five typical tasks in data stream processing: frequency estimation, heavy hitter detection, heavy change detection, frequency distribution estimation, and entropy estimation. Different algorithms are proposed for different tasks, but they seldom achieve high accuracy and high speed at the same time. To address this issue, we propose a novel data structure named HeavyGuardian. The key idea is to intelligently separate and guard the information of hot items while approximately record the frequencies of cold items. We deploy HeavyGuardian on the above five typical tasks. Extensive experimental results show that HeavyGuardian achieves both much higher accuracy and higher speed than the state-of-the-art solutions for each of the five typical tasks. The source codes of HeavyGuardian and other related algorithms are available at GitHub.

Skip Supplemental Material Section

Supplemental Material

gong_items_in_data_streams.mp4

mp4

464.1 MB

References

  1. The source codes of heavyguardian and other related algorithms. https://github.com/Gavindeed/HeavyGuardianvspace0.03in.Google ScholarGoogle Scholar
  2. Shoba Venkataraman, Dawn Song, Phillip B Gibbons, and Avrim Blum. New streaming algorithms for fast detection of superspreaders. Department of Electrical and Computing Engineering, page 6, 2005.Google ScholarGoogle Scholar
  3. Elisa Bertino. Introduction to data security and privacy. Data Science and Engineering, 1(3):125--126, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  4. Minlan Yu, Lavanya Jose, and Rui Miao. Software defined traffic measurement with opensketch. In NSDI, volume 13, pages 29--42, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ben Chen, Zhijin Lv, Xiaohui Yu, and Yang Liu. Sliding window top-k monitoring over distributed data streams. Data Science and Engineering, 2(4):289--300, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  6. Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computation of frequent and top-k elements in data streams. In International Conference on Database Theory, pages 398--412. Springer, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computation of frequent and top-k elements in data streams. In ¶roc Springer ICDT, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Graham Cormode, Flip Korn, S Muthukrishnan, and Divesh Srivastava. Finding hierarchical heavy hitters in data streams. In Proceedings of the 29th international conference on Very large data bases-Volume 29, pages 464--475, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ran Ben Basat, Gil Einziger, Roy Friedman, Marcelo Caggiani Luizelli, and Erez Waisbard. Constant time updates in hierarchical heavy hitters. arXiv preprint arXiv:1707.06778, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Nan Tang, Qing Chen, and Prasenjit Mitra. Graph stream summarization: From big bang to big crunch. In Proceedings of the 2016 International Conference on Management of Data, pages 1481--1496. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Graham Cormode. Sketch techniques for approximate query processing. Foundations and Trends in Databases. NOW publishers, 2011.Google ScholarGoogle Scholar
  12. Pratanu Roy, Arijit Khan, and Gustavo Alonso. Augmented sketch: Faster and more accurate stream processing. In ¶roc SIGMOD, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Tong Yang, Yang Zhou, Hao Jin, Shigang Chen, and Xiaoming Li. Pyramid sketch: a sketch framework for frequency estimation of data streams. Proceedings of the VLDB Endowment, 10(11):1442--1453, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yuliang Li, Rui Miao, Changhoon Kim, and Minlan Yu. Flowradar: A better netflow for data centers. In NSDI, pages 311--324, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Cristian Estan and George Varghese. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Transactions on Computer Systems (TOCS), 21(3):270--313, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Yin Zhang, Matthew Roughan, Walter Willinger, and Lili Qiu. Spatio-temporal compressive sensing and internet traffic matrices. In ACM SIGCOMM Computer Communication Review, volume 39, pages 267--278. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Theophilus Benson, Aditya Akella, and David A Maltz. Network traffic characteristics of data centers in the wild. In Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, pages 267--280. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Graham Cormode, Balachander Krishnamurthy, and Walter Willinger. A manifesto for modeling and measurement in social media. First Monday, 15(9), 2010.Google ScholarGoogle Scholar
  19. Dave Maltz. Unraveling the complexity of network management. 2009.Google ScholarGoogle Scholar
  20. Ilker Nadi Bozkurt, Yilun Zhou, Theophilus Benson, Bilal Anwer, Dave Levin, Nick Feamster, Aditya Akella, Balakrishnan Chandrasekaran, Cheng Huang, Bruce Maggs, et al. Dynamic prioritization of traffic in home networks. 2015.Google ScholarGoogle Scholar
  21. Jiecao Chen and Qin Zhang. Bias-aware sketches. Proceedings of the VLDB Endowment, 10(9):961--972, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. Automata, languages and programming, pages 784--784, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Graham Cormode and Shan Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Yang Zhou, Tong Yang, Jie Jiang, Bin Cui, Minlan Yu, Xiaoming Li, and Steve Uhlig. Cold filter: A meta-framework for faster and more accurate stream processing.Google ScholarGoogle Scholar
  25. Katsiaryna Mirylenka, Graham Cormode, Themis Palpanas, and Divesh Srivastava. Conditional heavy hitters: detecting interesting correlations in data streams. The VLDB Journal, 24(3):395--414, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Gobinda G Chowdhury. Introduction to modern information retrieval. Facet publishing, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Vibhaalakshmi Sivaraman, Srinivas Narayana, Ori Rottenstreich, S Muthukrishnan, and Jennifer Rexford. Heavy-hitter detection entirely in the data plane. In Proceedings of the Symposium on SDN Research, pages 164--176. ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Mohamed A Soliman, Ihab F Ilyas, and Kevin Chen-Chuan Chang. Top-k query processing in uncertain databases. In IEEE 23rd International Conference on Data Engineering, pages 896--905. IEEE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  29. Erik Demaine, Alejandro López-Ortiz, and J Munro. Frequency estimation of internet packet streams with limited space. Algorithms-ESA 2002, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. In ¶roc VLDB 2002, pages 346--357. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Monika Rauch Henzinger. Algorithmic challenges in web search engines. Internet Mathematics, 1(1):115--123, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  32. Er Krishnamurthy, Subhabrata Sen, and Yin Zhang. Sketchbased change detection: Methods, evaluation, and applications. In In ACM SIGCOMM Internet Measurement Conference. Citeseer, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Robert Schweller, Ashish Gupta, Elliot Parsons, and Yan Chen. Reversible sketches for efficient and accurate change detection over network data streams. In Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Chung Chen and Lon-Mu Liu. Forecasting time series with outliers. Journal of Forecasting, 12(1):13--35, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  35. Viswanath Poosala and Yannis E Ioannidis. Estimation of query-result distribution and its application in parallel-join load balancing. In VLDB, pages 448--459, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Shanshan Ying, Flip Korn, Barna Saha, and Divesh Srivastava. Treescope: finding structural anomalies in semi-structured data. VLDB, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Abhishek Kumar, Minho Sung, Jun Jim Xu, and Jia Wang. Data streaming algorithms for efficient and accurate estimation of flow size distribution. In ¶roc ACM SIGMETRICS, pages 177--188, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ge Luo, Lu Wang, Ke Yi, and Graham Cormode. Quantiles over data streams: experimental comparisons, new analyses, and further improvements. The VLDB Journal, 25(4):449--472, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Chun-Hung Cheng, Ada Waichee Fu, and Yi Zhang. Entropy-based subspace clustering for mining numerical data. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Zhetao Li, Baoming Chang, Shiguo Wang, Anfeng Liu, Fanzi Zeng, and Guangming Luo. Dynamic compressive wide-band spectrum sensing based on channel energy reconstruction in cognitive internet of things. IEEE Transactions on Industrial Informatics, 2018.Google ScholarGoogle Scholar
  41. Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, and Divesh Srivastava. Truth finding on the deep web: Is the problem solved? In Proceedings of the VLDB Endowment, volume 6, pages 97--108, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Zhetao Li, Fu Xiao, Shiguo Wang, Tingrui Pei, and Jie Li. Achievable rate maximization for cognitive hybrid satellite-terrestrial networks with af-relays. IEEE Journal on Selected Areas in Communications, 36(2):304--313, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  43. Ashwin Lall, Vyas Sekar, Mitsunori Ogihara, Jun Xu, and Hui Zhang. Data streaming algorithms for estimating entropy of network traffic. In ¶roc ACM SIGMETRICS, pages 145--156, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. The caida anonymized internet traces 2016. http://www.caida.org/data/overview/vspace0.03in.Google ScholarGoogle Scholar
  45. Frequent itemset mining dataset repository. http://fimi.ua.ac.be/data/.Google ScholarGoogle Scholar
  46. Christian Henke, Carsten Schmoll, and Tanja Zseby. Empirical evaluation of hash functions for multipoint measurements. SIGCOMM CCR., 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. HeavyGuardian: Separate and Guard Hot Items in Data Streams

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
          July 2018
          2925 pages
          ISBN:9781450355520
          DOI:10.1145/3219819

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 19 July 2018

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          KDD '18 Paper Acceptance Rate107of983submissions,11%Overall Acceptance Rate1,133of8,635submissions,13%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader