research-article

HeavyGuardian: Separate and Guard Hot Items in Data Streams

Authors:
Tong Yang

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Junzhi Gong

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Haowei Zhang

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Lei Zou

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Lei Shi

SKLCS, Institute of Software, Chinese Academy of Sciences, Beijing, China

SKLCS, Institute of Software, Chinese Academy of Sciences, Beijing, China
View Profile

,
Xiaoming Li

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningJuly 2018Pages 2584–2593https://doi.org/10.1145/3219819.3219978

Published:19 July 2018Publication History

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 2584–2593

ABSTRACT

Data stream processing is a fundamental issue in many fields, such as data mining, databases, network traffic measurement. There are five typical tasks in data stream processing: frequency estimation, heavy hitter detection, heavy change detection, frequency distribution estimation, and entropy estimation. Different algorithms are proposed for different tasks, but they seldom achieve high accuracy and high speed at the same time. To address this issue, we propose a novel data structure named HeavyGuardian. The key idea is to intelligently separate and guard the information of hot items while approximately record the frequencies of cold items. We deploy HeavyGuardian on the above five typical tasks. Extensive experimental results show that HeavyGuardian achieves both much higher accuracy and higher speed than the state-of-the-art solutions for each of the five typical tasks. The source codes of HeavyGuardian and other related algorithms are available at GitHub.

Supplemental Material

gong_items_in_data_streams.mp4

mp4

464.1 MB

Download

References

The source codes of heavyguardian and other related algorithms. https://github.com/Gavindeed/HeavyGuardianvspace0.03in.Google Scholar
Shoba Venkataraman, Dawn Song, Phillip B Gibbons, and Avrim Blum. New streaming algorithms for fast detection of superspreaders. Department of Electrical and Computing Engineering, page 6, 2005.Google Scholar
Elisa Bertino. Introduction to data security and privacy. Data Science and Engineering, 1(3):125--126, 2016.Google ScholarCross Ref
Minlan Yu, Lavanya Jose, and Rui Miao. Software defined traffic measurement with opensketch. In NSDI, volume 13, pages 29--42, 2013. Google ScholarDigital Library
Ben Chen, Zhijin Lv, Xiaohui Yu, and Yang Liu. Sliding window top-k monitoring over distributed data streams. Data Science and Engineering, 2(4):289--300, 2017.Google ScholarCross Ref
Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computation of frequent and top-k elements in data streams. In International Conference on Database Theory, pages 398--412. Springer, 2005. Google ScholarDigital Library
Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computation of frequent and top-k elements in data streams. In ¶roc Springer ICDT, 2005. Google ScholarDigital Library
Graham Cormode, Flip Korn, S Muthukrishnan, and Divesh Srivastava. Finding hierarchical heavy hitters in data streams. In Proceedings of the 29th international conference on Very large data bases-Volume 29, pages 464--475, 2003. Google ScholarDigital Library
Ran Ben Basat, Gil Einziger, Roy Friedman, Marcelo Caggiani Luizelli, and Erez Waisbard. Constant time updates in hierarchical heavy hitters. arXiv preprint arXiv:1707.06778, 2017. Google ScholarDigital Library
Nan Tang, Qing Chen, and Prasenjit Mitra. Graph stream summarization: From big bang to big crunch. In Proceedings of the 2016 International Conference on Management of Data, pages 1481--1496. ACM, 2016. Google ScholarDigital Library
Graham Cormode. Sketch techniques for approximate query processing. Foundations and Trends in Databases. NOW publishers, 2011.Google Scholar
Pratanu Roy, Arijit Khan, and Gustavo Alonso. Augmented sketch: Faster and more accurate stream processing. In ¶roc SIGMOD, 2016. Google ScholarDigital Library
Tong Yang, Yang Zhou, Hao Jin, Shigang Chen, and Xiaoming Li. Pyramid sketch: a sketch framework for frequency estimation of data streams. Proceedings of the VLDB Endowment, 10(11):1442--1453, 2017. Google ScholarDigital Library
Yuliang Li, Rui Miao, Changhoon Kim, and Minlan Yu. Flowradar: A better netflow for data centers. In NSDI, pages 311--324, 2016. Google ScholarDigital Library
Cristian Estan and George Varghese. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Transactions on Computer Systems (TOCS), 21(3):270--313, 2003. Google ScholarDigital Library
Yin Zhang, Matthew Roughan, Walter Willinger, and Lili Qiu. Spatio-temporal compressive sensing and internet traffic matrices. In ACM SIGCOMM Computer Communication Review, volume 39, pages 267--278. ACM, 2009. Google ScholarDigital Library
Theophilus Benson, Aditya Akella, and David A Maltz. Network traffic characteristics of data centers in the wild. In Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, pages 267--280. ACM, 2010. Google ScholarDigital Library
Graham Cormode, Balachander Krishnamurthy, and Walter Willinger. A manifesto for modeling and measurement in social media. First Monday, 15(9), 2010.Google Scholar
Dave Maltz. Unraveling the complexity of network management. 2009.Google Scholar
Ilker Nadi Bozkurt, Yilun Zhou, Theophilus Benson, Bilal Anwer, Dave Levin, Nick Feamster, Aditya Akella, Balakrishnan Chandrasekaran, Cheng Huang, Bruce Maggs, et al. Dynamic prioritization of traffic in home networks. 2015.Google Scholar
Jiecao Chen and Qin Zhang. Bias-aware sketches. Proceedings of the VLDB Endowment, 10(9):961--972, 2017. Google ScholarDigital Library
Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. Automata, languages and programming, pages 784--784, 2002. Google ScholarDigital Library
Graham Cormode and Shan Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 2005. Google ScholarDigital Library
Yang Zhou, Tong Yang, Jie Jiang, Bin Cui, Minlan Yu, Xiaoming Li, and Steve Uhlig. Cold filter: A meta-framework for faster and more accurate stream processing.Google Scholar
Katsiaryna Mirylenka, Graham Cormode, Themis Palpanas, and Divesh Srivastava. Conditional heavy hitters: detecting interesting correlations in data streams. The VLDB Journal, 24(3):395--414, 2015. Google ScholarDigital Library
Gobinda G Chowdhury. Introduction to modern information retrieval. Facet publishing, 2010. Google ScholarDigital Library
Vibhaalakshmi Sivaraman, Srinivas Narayana, Ori Rottenstreich, S Muthukrishnan, and Jennifer Rexford. Heavy-hitter detection entirely in the data plane. In Proceedings of the Symposium on SDN Research, pages 164--176. ACM, 2017. Google ScholarDigital Library
Mohamed A Soliman, Ihab F Ilyas, and Kevin Chen-Chuan Chang. Top-k query processing in uncertain databases. In IEEE 23rd International Conference on Data Engineering, pages 896--905. IEEE, 2007.Google ScholarCross Ref
Erik Demaine, Alejandro López-Ortiz, and J Munro. Frequency estimation of internet packet streams with limited space. Algorithms-ESA 2002, 2002. Google ScholarDigital Library
Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. In ¶roc VLDB 2002, pages 346--357. Google ScholarDigital Library
Monika Rauch Henzinger. Algorithmic challenges in web search engines. Internet Mathematics, 1(1):115--123, 2004.Google ScholarCross Ref
Er Krishnamurthy, Subhabrata Sen, and Yin Zhang. Sketchbased change detection: Methods, evaluation, and applications. In In ACM SIGCOMM Internet Measurement Conference. Citeseer, 2003. Google ScholarDigital Library
Robert Schweller, Ashish Gupta, Elliot Parsons, and Yan Chen. Reversible sketches for efficient and accurate change detection over network data streams. In Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, 2004. Google ScholarDigital Library
Chung Chen and Lon-Mu Liu. Forecasting time series with outliers. Journal of Forecasting, 12(1):13--35, 1993.Google ScholarCross Ref
Viswanath Poosala and Yannis E Ioannidis. Estimation of query-result distribution and its application in parallel-join load balancing. In VLDB, pages 448--459, 1996. Google ScholarDigital Library
Shanshan Ying, Flip Korn, Barna Saha, and Divesh Srivastava. Treescope: finding structural anomalies in semi-structured data. VLDB, 2015. Google ScholarDigital Library
Abhishek Kumar, Minho Sung, Jun Jim Xu, and Jia Wang. Data streaming algorithms for efficient and accurate estimation of flow size distribution. In ¶roc ACM SIGMETRICS, pages 177--188, 2004. Google ScholarDigital Library
Ge Luo, Lu Wang, Ke Yi, and Graham Cormode. Quantiles over data streams: experimental comparisons, new analyses, and further improvements. The VLDB Journal, 25(4):449--472, 2016. Google ScholarDigital Library
Chun-Hung Cheng, Ada Waichee Fu, and Yi Zhang. Entropy-based subspace clustering for mining numerical data. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, 1999. Google ScholarDigital Library
Zhetao Li, Baoming Chang, Shiguo Wang, Anfeng Liu, Fanzi Zeng, and Guangming Luo. Dynamic compressive wide-band spectrum sensing based on channel energy reconstruction in cognitive internet of things. IEEE Transactions on Industrial Informatics, 2018.Google Scholar
Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, and Divesh Srivastava. Truth finding on the deep web: Is the problem solved? In Proceedings of the VLDB Endowment, volume 6, pages 97--108, 2012. Google ScholarDigital Library
Zhetao Li, Fu Xiao, Shiguo Wang, Tingrui Pei, and Jie Li. Achievable rate maximization for cognitive hybrid satellite-terrestrial networks with af-relays. IEEE Journal on Selected Areas in Communications, 36(2):304--313, 2018.Google ScholarCross Ref
Ashwin Lall, Vyas Sekar, Mitsunori Ogihara, Jun Xu, and Hui Zhang. Data streaming algorithms for estimating entropy of network traffic. In ¶roc ACM SIGMETRICS, pages 145--156, 2006. Google ScholarDigital Library
The caida anonymized internet traces 2016. http://www.caida.org/data/overview/vspace0.03in.Google Scholar
Frequent itemset mining dataset repository. http://fimi.ua.ac.be/data/.Google Scholar
Christian Henke, Carsten Schmoll, and Tanja Zseby. Empirical evaluation of hash functions for multipoint measurements. SIGCOMM CCR., 2008. Google ScholarDigital Library

Index Terms

HeavyGuardian: Separate and Guard Hot Items in Data Streams
1. Information systems
  1. Data management systems
    1. Data structures
    2. Database design and models
      1. Data model extensions
        Data streams
  2. Information systems applications
    1. Data mining

Recommendations

Generic windowing support for extensible stream processing systems

Stream processing applications process high volume, continuous feeds from live data sources, employ data-in-motion analytics to analyze these feeds, and produce near real-time insights with low latency. One of the fundamental characteristics of such ...
Read More
ISE: A High Performance System for Processing Data Streams
DATA 2016: Proceedings of the 5th International Conference on Data Management Technologies and Applications

Many organizations require the ability to manage high-volume high-speed streaming data to perform analysis and other tasks in real-time. In this work, we present the Information Streaming Engine, a high-performance data stream processing system capable ...
Read More
Elastic Stream Computing with Clouds
CLOUD '11: Proceedings of the 2011 IEEE 4th International Conference on Cloud Computing

Stream computing, also known as data stream processing, has emerged as a new processing paradigm that processes incoming data streams from tremendous numbers of sensors in a real-time fashion. Data stream applications must have low latency even when the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2018
2925 pages
ISBN:9781450355520
DOI:10.1145/3219819
General Chairs:
Yike Guo
Imperial College London
,
Faisal Farooq
IBM
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data stream processing
data sturcture
probabilistic and approximate data
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '18 Paper Acceptance Rate107of983submissions,11%Overall Acceptance Rate1,133of8,635submissions,13%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 50
  Total Citations
  View Citations
- 935
  Total Downloads
- Downloads (Last 12 months)76
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HeavyGuardian: Separate and Guard Hot Items in Data Streams

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Generic windowing support for extensible stream processing systems

ISE: A High Performance System for Processing Data Streams

Elastic Stream Computing with Clouds

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

HeavyGuardian: Separate and Guard Hot Items in Data Streams

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Generic windowing support for extensible stream processing systems

ISE: A High Performance System for Processing Data Streams

Elastic Stream Computing with Clouds

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media