ABSTRACT
Data stream processing is a fundamental issue in many fields, such as data mining, databases, network traffic measurement. There are five typical tasks in data stream processing: frequency estimation, heavy hitter detection, heavy change detection, frequency distribution estimation, and entropy estimation. Different algorithms are proposed for different tasks, but they seldom achieve high accuracy and high speed at the same time. To address this issue, we propose a novel data structure named HeavyGuardian. The key idea is to intelligently separate and guard the information of hot items while approximately record the frequencies of cold items. We deploy HeavyGuardian on the above five typical tasks. Extensive experimental results show that HeavyGuardian achieves both much higher accuracy and higher speed than the state-of-the-art solutions for each of the five typical tasks. The source codes of HeavyGuardian and other related algorithms are available at GitHub.
Supplemental Material
- The source codes of heavyguardian and other related algorithms. https://github.com/Gavindeed/HeavyGuardianvspace0.03in.Google Scholar
- Shoba Venkataraman, Dawn Song, Phillip B Gibbons, and Avrim Blum. New streaming algorithms for fast detection of superspreaders. Department of Electrical and Computing Engineering, page 6, 2005.Google Scholar
- Elisa Bertino. Introduction to data security and privacy. Data Science and Engineering, 1(3):125--126, 2016.Google ScholarCross Ref
- Minlan Yu, Lavanya Jose, and Rui Miao. Software defined traffic measurement with opensketch. In NSDI, volume 13, pages 29--42, 2013. Google ScholarDigital Library
- Ben Chen, Zhijin Lv, Xiaohui Yu, and Yang Liu. Sliding window top-k monitoring over distributed data streams. Data Science and Engineering, 2(4):289--300, 2017.Google ScholarCross Ref
- Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computation of frequent and top-k elements in data streams. In International Conference on Database Theory, pages 398--412. Springer, 2005. Google ScholarDigital Library
- Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computation of frequent and top-k elements in data streams. In ¶roc Springer ICDT, 2005. Google ScholarDigital Library
- Graham Cormode, Flip Korn, S Muthukrishnan, and Divesh Srivastava. Finding hierarchical heavy hitters in data streams. In Proceedings of the 29th international conference on Very large data bases-Volume 29, pages 464--475, 2003. Google ScholarDigital Library
- Ran Ben Basat, Gil Einziger, Roy Friedman, Marcelo Caggiani Luizelli, and Erez Waisbard. Constant time updates in hierarchical heavy hitters. arXiv preprint arXiv:1707.06778, 2017. Google ScholarDigital Library
- Nan Tang, Qing Chen, and Prasenjit Mitra. Graph stream summarization: From big bang to big crunch. In Proceedings of the 2016 International Conference on Management of Data, pages 1481--1496. ACM, 2016. Google ScholarDigital Library
- Graham Cormode. Sketch techniques for approximate query processing. Foundations and Trends in Databases. NOW publishers, 2011.Google Scholar
- Pratanu Roy, Arijit Khan, and Gustavo Alonso. Augmented sketch: Faster and more accurate stream processing. In ¶roc SIGMOD, 2016. Google ScholarDigital Library
- Tong Yang, Yang Zhou, Hao Jin, Shigang Chen, and Xiaoming Li. Pyramid sketch: a sketch framework for frequency estimation of data streams. Proceedings of the VLDB Endowment, 10(11):1442--1453, 2017. Google ScholarDigital Library
- Yuliang Li, Rui Miao, Changhoon Kim, and Minlan Yu. Flowradar: A better netflow for data centers. In NSDI, pages 311--324, 2016. Google ScholarDigital Library
- Cristian Estan and George Varghese. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Transactions on Computer Systems (TOCS), 21(3):270--313, 2003. Google ScholarDigital Library
- Yin Zhang, Matthew Roughan, Walter Willinger, and Lili Qiu. Spatio-temporal compressive sensing and internet traffic matrices. In ACM SIGCOMM Computer Communication Review, volume 39, pages 267--278. ACM, 2009. Google ScholarDigital Library
- Theophilus Benson, Aditya Akella, and David A Maltz. Network traffic characteristics of data centers in the wild. In Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, pages 267--280. ACM, 2010. Google ScholarDigital Library
- Graham Cormode, Balachander Krishnamurthy, and Walter Willinger. A manifesto for modeling and measurement in social media. First Monday, 15(9), 2010.Google Scholar
- Dave Maltz. Unraveling the complexity of network management. 2009.Google Scholar
- Ilker Nadi Bozkurt, Yilun Zhou, Theophilus Benson, Bilal Anwer, Dave Levin, Nick Feamster, Aditya Akella, Balakrishnan Chandrasekaran, Cheng Huang, Bruce Maggs, et al. Dynamic prioritization of traffic in home networks. 2015.Google Scholar
- Jiecao Chen and Qin Zhang. Bias-aware sketches. Proceedings of the VLDB Endowment, 10(9):961--972, 2017. Google ScholarDigital Library
- Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. Automata, languages and programming, pages 784--784, 2002. Google ScholarDigital Library
- Graham Cormode and Shan Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 2005. Google ScholarDigital Library
- Yang Zhou, Tong Yang, Jie Jiang, Bin Cui, Minlan Yu, Xiaoming Li, and Steve Uhlig. Cold filter: A meta-framework for faster and more accurate stream processing.Google Scholar
- Katsiaryna Mirylenka, Graham Cormode, Themis Palpanas, and Divesh Srivastava. Conditional heavy hitters: detecting interesting correlations in data streams. The VLDB Journal, 24(3):395--414, 2015. Google ScholarDigital Library
- Gobinda G Chowdhury. Introduction to modern information retrieval. Facet publishing, 2010. Google ScholarDigital Library
- Vibhaalakshmi Sivaraman, Srinivas Narayana, Ori Rottenstreich, S Muthukrishnan, and Jennifer Rexford. Heavy-hitter detection entirely in the data plane. In Proceedings of the Symposium on SDN Research, pages 164--176. ACM, 2017. Google ScholarDigital Library
- Mohamed A Soliman, Ihab F Ilyas, and Kevin Chen-Chuan Chang. Top-k query processing in uncertain databases. In IEEE 23rd International Conference on Data Engineering, pages 896--905. IEEE, 2007.Google ScholarCross Ref
- Erik Demaine, Alejandro López-Ortiz, and J Munro. Frequency estimation of internet packet streams with limited space. Algorithms-ESA 2002, 2002. Google ScholarDigital Library
- Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. In ¶roc VLDB 2002, pages 346--357. Google ScholarDigital Library
- Monika Rauch Henzinger. Algorithmic challenges in web search engines. Internet Mathematics, 1(1):115--123, 2004.Google ScholarCross Ref
- Er Krishnamurthy, Subhabrata Sen, and Yin Zhang. Sketchbased change detection: Methods, evaluation, and applications. In In ACM SIGCOMM Internet Measurement Conference. Citeseer, 2003. Google ScholarDigital Library
- Robert Schweller, Ashish Gupta, Elliot Parsons, and Yan Chen. Reversible sketches for efficient and accurate change detection over network data streams. In Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, 2004. Google ScholarDigital Library
- Chung Chen and Lon-Mu Liu. Forecasting time series with outliers. Journal of Forecasting, 12(1):13--35, 1993.Google ScholarCross Ref
- Viswanath Poosala and Yannis E Ioannidis. Estimation of query-result distribution and its application in parallel-join load balancing. In VLDB, pages 448--459, 1996. Google ScholarDigital Library
- Shanshan Ying, Flip Korn, Barna Saha, and Divesh Srivastava. Treescope: finding structural anomalies in semi-structured data. VLDB, 2015. Google ScholarDigital Library
- Abhishek Kumar, Minho Sung, Jun Jim Xu, and Jia Wang. Data streaming algorithms for efficient and accurate estimation of flow size distribution. In ¶roc ACM SIGMETRICS, pages 177--188, 2004. Google ScholarDigital Library
- Ge Luo, Lu Wang, Ke Yi, and Graham Cormode. Quantiles over data streams: experimental comparisons, new analyses, and further improvements. The VLDB Journal, 25(4):449--472, 2016. Google ScholarDigital Library
- Chun-Hung Cheng, Ada Waichee Fu, and Yi Zhang. Entropy-based subspace clustering for mining numerical data. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, 1999. Google ScholarDigital Library
- Zhetao Li, Baoming Chang, Shiguo Wang, Anfeng Liu, Fanzi Zeng, and Guangming Luo. Dynamic compressive wide-band spectrum sensing based on channel energy reconstruction in cognitive internet of things. IEEE Transactions on Industrial Informatics, 2018.Google Scholar
- Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, and Divesh Srivastava. Truth finding on the deep web: Is the problem solved? In Proceedings of the VLDB Endowment, volume 6, pages 97--108, 2012. Google ScholarDigital Library
- Zhetao Li, Fu Xiao, Shiguo Wang, Tingrui Pei, and Jie Li. Achievable rate maximization for cognitive hybrid satellite-terrestrial networks with af-relays. IEEE Journal on Selected Areas in Communications, 36(2):304--313, 2018.Google ScholarCross Ref
- Ashwin Lall, Vyas Sekar, Mitsunori Ogihara, Jun Xu, and Hui Zhang. Data streaming algorithms for estimating entropy of network traffic. In ¶roc ACM SIGMETRICS, pages 145--156, 2006. Google ScholarDigital Library
- The caida anonymized internet traces 2016. http://www.caida.org/data/overview/vspace0.03in.Google Scholar
- Frequent itemset mining dataset repository. http://fimi.ua.ac.be/data/.Google Scholar
- Christian Henke, Carsten Schmoll, and Tanja Zseby. Empirical evaluation of hash functions for multipoint measurements. SIGCOMM CCR., 2008. Google ScholarDigital Library
Index Terms
- HeavyGuardian: Separate and Guard Hot Items in Data Streams
Recommendations
Generic windowing support for extensible stream processing systems
Stream processing applications process high volume, continuous feeds from live data sources, employ data-in-motion analytics to analyze these feeds, and produce near real-time insights with low latency. One of the fundamental characteristics of such ...
ISE: A High Performance System for Processing Data Streams
DATA 2016: Proceedings of the 5th International Conference on Data Management Technologies and ApplicationsMany organizations require the ability to manage high-volume high-speed streaming data to perform analysis and other tasks in real-time. In this work, we present the Information Streaming Engine, a high-performance data stream processing system capable ...
Elastic Stream Computing with Clouds
CLOUD '11: Proceedings of the 2011 IEEE 4th International Conference on Cloud ComputingStream computing, also known as data stream processing, has emerged as a new processing paradigm that processes incoming data streams from tremendous numbers of sensors in a real-time fashion. Data stream applications must have low latency even when the ...
Comments