research-article

Maximally informative k-itemset mining from massively distributed data streams

Authors:
Mehdi Zitouni

INRIA and University of Montpellier, France and Université de Tunis ElManar, Tunis, Tunisia

INRIA and University of Montpellier, France and Université de Tunis ElManar, Tunis, Tunisia
View Profile

,
Reza Akbarinia

INRIA and University of Montpellier, France

INRIA and University of Montpellier, France
View Profile

,
Sadok Ben Yahia

Université de Tunis ElManar, Tunis, Tunisia

Université de Tunis ElManar, Tunis, Tunisia
View Profile

,
Florent Masseglia

INRIA and University of Montpellier, France

INRIA and University of Montpellier, France
View Profile

SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied ComputingApril 2018Pages 502–509https://doi.org/10.1145/3167132.3167187

Published:09 April 2018Publication History

SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied Computing

Pages 502–509

ABSTRACT

We address the problem of mining maximally informative k-itemsets (miki) in data streams based on joint entropy. We propose PentroS, a highly scalable parallel miki mining algorithm. PentroS renders the mining process of large volumes of incoming data very efficient. It is designed to take into account the continuous aspect of data streams, particularly by reducing the computations of need for updating the miki results after arrival/departure of transactions to/from the sliding window. PentroS has been extensively evaluated using massive real-world data streams. Our experimental results confirm the effectiveness of our proposal which allows excellent throughput with high itemset length.

References

Youssef Bassil. 2012. A Survey on Information Retrieval, Text Categorization, and Web Crawling. CoRR (2012).Google Scholar
Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. 2004. Finding frequent items in data streams. Theor. Comput. Sci. (2004). Google ScholarDigital Library
Thomas M. Cover and Joy A. Thomas. 2006. Elements of information theory (2. ed.). Wiley.Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In OSDI 2004. San Francisco, California, USA. Google ScholarDigital Library
Anna C. Gilbert, Yannis Kotidis, S. Muthukrishnan, and Martin Strauss. 2001. Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries. In VLDB 2001. Roma, Italy. Google ScholarDigital Library
Florin Gorunescu. 2011. Data Mining - Concepts, Models and Techniques. Springer.Google Scholar
Hannes Heikinheimo, Jouni K. Seppänen, Eino Hinkkanen, Heikki Mannila, and Taneli Mielikäinen. 2007. Finding low-entropy sets and trees from binary data. In ACM SIGKDD 2007. San Jose, California, USA. Google ScholarDigital Library
Cong-Rui Ji and Zhi-Hong Deng. 2007. Mining Frequent Ordered Patterns without Candidate Generation. In FSKD 2007. Haikou, Hainan, China. Google ScholarDigital Library
Arno J. Knobbe and Eric K. Y. Ho. 2006. Maximally informative k-itemsets and their efficient discovery. In ACM SIGKDD 2006. Philadelphia, PA, USA. Google ScholarDigital Library
Hoang Thanh Lam and Toon Calders. 2010. Mining top-k frequent items in a data stream with flexible sliding windows. In ACM SIGKDD 2010. Washington, DC, USA. Google ScholarDigital Library
Sandy Moens, Emin Aksehirli, and Bart Goethals. 2013. Frequent Itemset Mining for Big Data. In IEEE BigData 2013. Santa Clara, CA, USA.Google ScholarCross Ref
Odysseas Papapetrou, Minos N. Garofalakis, and Antonios Deligiannakis. 2015. Sketching distributed sliding-window data streams. The VLDB Journal (2015). Google ScholarDigital Library
Thomas A. Runkler. 2016. Data Analytics - Models and Algorithms for Intelligent Data Analysis. Springer. Google ScholarDigital Library
Saber Salah, Reza Akbarinia, and Florent Masseglia. 2015. Fast Parallel Mining of Maximally Informative k-Itemsets in Big Data. In ICDM 2015. Atlantic City, USA. Google ScholarDigital Library
Wei-Guang Teng, Ming-Syan Chen, and Philip S. Yu. 2003. A Regression-Based Temporal Pattern Mining Scheme for Data Streams. In VLDB 2003. Google ScholarDigital Library
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In HotCloud 2010. Boston, USA. Google ScholarDigital Library
Chongsheng Zhang and Florent Masseglia. 2010. Discovering Highly Informative Feature Sets from Data Streams. In DEXA 2010. Bilbao, Spain. Google ScholarDigital Library
Mehdi Zitouni, Reza Akbarinia, Sadok Ben Yahia, and Florent Masseglia. 2015. A Prime Number Based Approach for Closed Frequent Itemset Mining in Big Data. In DEXA 2015. Valencia, Spain. Google ScholarDigital Library

Index Terms

Maximally informative k-itemset mining from massively distributed data streams
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms
    2. Parallel programming languages

Recommendations

Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis

Mining frequent itemsets over data streams has attracted much research attention in recent years. In the past, we had developed a hash-based approach for mining frequent itemsets over a single data stream. In this paper, we extend that approach to mine ...
Read More
SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming
Abstract
Finding frequent itemsets in a continuous streaming data is an important data mining task which is widely used in network monitoring, Internet of Things data analysis and so on. In the era of big data, it is necessary to develop a distributed ...
Read More
Frequent Closed Informative Itemset Mining
CIS '07: Proceedings of the 2007 International Conference on Computational Intelligence and Security

In recent years, cluster analysis and association analysis have attracted a lot of attention for large data analysis such as biomedical data analysis. This paper proposes a novel algorithm of frequent closed itemset mining. The algorithm addresses two ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied Computing
April 2018
2327 pages
ISBN:9781450351911
DOI:10.1145/3167132
Conference Chairs:
Hisham M. Haddad
Kennesaw State University
,
Roger L. Wainwright
University of Tulsa
,
Richard Chbeir
University of Pau & Pays Adour, France
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 April 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
miki mining
distributed data streams
spark streaming
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,650of6,669submissions,25%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 71
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Maximally informative k-itemset mining from massively distributed data streams

SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis

SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming

Frequent Closed Informative Itemset Mining

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Maximally informative k-itemset mining from massively distributed data streams

SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis

SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming

Frequent Closed Informative Itemset Mining

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media