Abstract
Self-sufficient itemsets have been proposed as an effective approach to summarizing the key associations in data. However, their computation appears highly demanding, as assessing whether an itemset is self-sufficient requires consideration of all pairwise partitions of the itemset into pairs of subsets as well as consideration of all supersets. This article presents the first published algorithm for efficiently discovering self-sufficient itemsets. This branch-and-bound algorithm deploys two powerful pruning mechanisms based on upper bounds on itemset value and statistical significance level. It demonstrates that finding top-k productive and nonredundant itemsets, with postprocessing to identify those that are not independently productive, can efficiently identify small sets of key associations. We present extensive evaluation of the strengths and limitations of the technique, including comparisons with alternative approaches to finding the most interesting associations.
- R. Agrawal, T. Imielinski, and A. Swami. 1993. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM-SIGMOD International Conference on Management of Data. 207--216. Google ScholarDigital Library
- Y. Bastide, N. Pasquier, R. Taouil, G. Stumme, and L. Lakhal. 2000. Mining minimal non-redundant association rules using frequent closed itemsets. In Proceedings of the 1st International Conference on Computational Logic (CL’00). Springer-Verlag, Berlin, 972--986. Google ScholarDigital Library
- R. J. Bayardo, Jr., R. Agrawal, and D. Gunopulos. 2000. Constraint-based rule mining in large, dense databases. Data Mining and Knowledge Discovery 4, 2--3, 217--240. Google ScholarDigital Library
- R. Brijs, G. Swinnen, K. Vanhoof, and G.Wets. 1999. Using association rules for product assortment decisions: A case study. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, NY, 254--260. Google ScholarDigital Library
- T. Calders and B. Goethals. 2002. Mining all non-derivable frequent itemsets. In Proceedings of the 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKKD’02). Springer, Berlin, 74--85. Google ScholarDigital Library
- T. Calders and B. Goethals. 2007. Non-derivable itemset mining. Data Mining and Knowledge Discovery 14, 1, 171--206. Google ScholarDigital Library
- T. De Bie. 2011. Maximum entropy models and subjective interestingness: An application to tiles in binary databases. Data Mining and Knowledge Discovery 23, 3, 407--446. Google ScholarDigital Library
- A. W. C. Fu, W. K. Renfrew, and J. Tang. 2000. Mining N-most interesting itemsets. In Proceedings of the 12th International Symposium on Foundations of Intelligent Systems. 59--67. Google ScholarDigital Library
- A. Gallo, T. De Bie, and N. Cristianini. 2007. MINI: Mining informative non-redundant itemsets. In Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’07). Lecture Notes in Computer Science, Joost Kok, Jacek Koronacki, Ramon Lopez de Mantaras, Stan Matwin, Dunja Mladenic, and Andrzej Skowron (Eds.), Vol. 4702. Springer, Berlin/Heidelberg, 438--445. Google ScholarDigital Library
- L. Geng and H. J. Hamilton. 2006. Interestingness measures for data mining: A survey. Computing Surveys 38, 3, 9. Google ScholarDigital Library
- K. Geurts, G. Wets, T. Brijs, and K. Vanhoof. 2003. Profiling high frequency accident locations using association rules. In Proceedings of the 82nd Annual Transportation Research Board.Google Scholar
- B. Goethals. 2012. Frequent Itemset Mining Implementations Repository. Retrieved April 26, 2014, from http://fimi.ua.ac.be/.Google Scholar
- W. Hämäläinen. 2010. Efficient Search for Statistically Significant Dependency Rules in Binary Data. Ph.D. Dissertation. Department of Computer Science, University of Helsinki.Google Scholar
- W. Hämäläinen. 2012. Kingfisher: An efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures. Knowledge and Information Systems 32, 2, 383--414.Google ScholarCross Ref
- J. Han, H. Cheng, D. Xin, and X. Yan. 2007. Frequent pattern mining: Current status and future directions. Data Mining and Knowledge Discovery 15, 1, 55--86. Google ScholarDigital Library
- S. Hanhijärvi, M. Ojala, N. Vuokko, K. Puolamäki, N. Tatti, and H. Mannila. 2009. Tell me something I don’t know: Randomization strategies for iterative data mining. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 379--388. Google ScholarDigital Library
- S. Jaroszewicz, T. Scheffer, and D. A. Simovici. 2009. Scalable pattern mining with Bayesian networks as background knowledge. Data Mining and Knowledge Discovery 18, 1, 56--100. Google ScholarDigital Library
- E. T. Jaynes. 1982. On the rationale of maximum-entropy methods. Proceedings of the IEEE 70, 9, 939--952.Google ScholarCross Ref
- K.-N. Kontonasios and T. De Bie. 2010. An information-theoretic approach to finding noisy tiles in binary databases. In Proceedings of the 10th SIAM International Conference on Data Mining (SDM’10). SIAM, Columbus, OH, 153--164.Google Scholar
- J. Lijffijt, P. Papapetrou, and K. Puolamaki. 2012. A statistical significance testing approach to mining the most informative set of patterns. Data Mining and Knowledge Discovery 28, 1, 238--263. DOI: http://dx.doi.org/10.1007/s10618-012-0298-2 Google ScholarDigital Library
- M. Mampaey, N. Tatti, and J. Vreeken. 2011. Tell me what I need to know: Succinctly summarizing data with itemsets. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 573--581. Google ScholarDigital Library
- M. Mampaey, J. Vreeken, and N. Tatti. 2012. Summarizing data succinctly with the most informative itemsets. ACM Transactions on Knowledge Discovery from Data 6, 4, 1--44. Google ScholarDigital Library
- P. K. Novak, N. Lavrac, and G. I. Webb. 2009. Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup discovery. Journal of Machine Learning Research 10, 377--403. Google ScholarDigital Library
- N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. 1999a. Discovering frequent closed itemsets for association rules. In Proceedings of the 7th International Conference on Database Theory (ICDT’99). 398--416. Google ScholarDigital Library
- N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. 1999b. Efficient mining of association rules using closed itemset lattices. Information Systems 24, 1, 25--46. Google ScholarDigital Library
- G. Piatetsky-Shapiro. 1991. Discovery, analysis, and presentation of strong rules. In Knowledge Discovery in Databases, Gregory Piatetsky-Shapiro and J. Frawley (Eds.). AAAI/MIT Press, Menlo Park, CA, 229--248.Google Scholar
- J. Rissanen. 1978. Modeling by shortest data description. Automatica 14, 1, 465--471. Google ScholarDigital Library
- R. Rymon. 1992. Search through systematic set enumeration. In Proceedings of KR-92. 268--275.Google Scholar
- A. Siebes, J. Vreeken, and M. van Leeuwen. 2006. Item sets that compress. In Proceedings of the 6th SIAM International Conference on Data Mining (SDM’06). SIAM, Bethesda, MD, 393--404.Google ScholarCross Ref
- N. Tatti. 2008. Maximum entropy based significance of itemsets. Knowledge and Information Systems 17, 1, 57--77. Google ScholarDigital Library
- N. Tatti and M. Mampaey. 2010. Using background knowledge to rank itemsets. Data Mining and Knowledge Discovery 21, 2, 293--309. Google ScholarDigital Library
- N. Tatti and J. Vreeken. 2012. Comparing apples and oranges—measuring differences between exploratory data mining results. Data Mining and Knowledge Discovery 25, 2, 173--207. Google ScholarDigital Library
- C. Tew, C. Giraud-Carrier, K. Tanner, and S. Burton. 2014. Behavior-based clustering and analysis of interestingness measures for association rule mining. Data Mining and Knowledge Discovery 28, 4, 1004--1045. DOI: http://dx.doi.org/10.1007/s10618-013-0326-x Google ScholarDigital Library
- J. Vreeken, M. van Leeuwen, and A. Siebes. 2011. Krimp: Mining itemsets that compress. Data Mining and Knowledge Discovery 23, 1, 169--214. Google ScholarDigital Library
- C. Wang and S. Parthasarathy. 2006. Summarizing itemset patterns using probabilistic models. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’06). 730--735. Google ScholarDigital Library
- G. I. Webb. 1995. OPUS: An efficient admissible algorithm for unordered search. Journal of Artificial Intelligence Research 3, 431--465. Google ScholarDigital Library
- G. I. Webb. 2000. Efficient search for association rules. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’00). ACM, New York, NY, 99--107. Google ScholarDigital Library
- G. I. Webb. 2006. Discovering significant rules. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). ACM, New York, NY, 434--443. Google ScholarDigital Library
- G. I. Webb. 2007. Discovering significant patterns. Machine Learning 68, 1, 1--33. Google ScholarDigital Library
- G. I. Webb. 2008. Layered critical values: A powerful direct-adjustment approach to discovering significant patterns. Machine Learning 71, 2--3, 307--323. Google ScholarDigital Library
- G. I. Webb. 2010. Self-sufficient itemsets: An approach to screening potentially interesting associations between items. Transactions on Knowledge Discovery from Data 4, 3:1--3:20. Google ScholarDigital Library
- G. I. Webb. 2011. Filtered-top-k association discovery. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1, 3, 183--192. DOI: http://dx.doi.org/10.1002/widm.28Google ScholarCross Ref
- G. I. Webb and S. Zhang. 2005. K-Optimal rule discovery. Data Mining and Knowledge Discovery 10, 1, 39--79. Google ScholarDigital Library
- X. Wu, C. Zhang, and S. Zhang. 2004. Efficient mining of both positive and negative association rules. ACM Transactions on Information Systems 22, 3, 381--405. Google ScholarDigital Library
- M. J. Zaki. 2000. Generating non-redundant association rules. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’00). ACM, New York, NY, 34--43. Google ScholarDigital Library
- M. J. Zaki and C. J. Hsiao. 2002. CHARM: An efficient algorithm for closed itemset mining. In Proceedings of the 2nd SIAM International Conference on Data Mining. 457--473.Google Scholar
- A. Zimmermann. 2013. Objectively evaluating interestingness measures for frequent itemset mining. In Proceedings of the Emerging Trends in Knowledge Discovery and Data Mining International Workshops (PAKDD’13), 354--366. http://link.springer.com/chapter/10.1007%2F978-3-642-40319-4_31.Google ScholarDigital Library
Index Terms
- Efficient Discovery of the Most Interesting Associations
Recommendations
Re-mining item associations: Methodology and a case study in apparel retailing
Association mining is the conventional data mining technique for analyzing market basket data and it reveals the positive and negative associations between items. While being an integral part of transaction data, pricing and time information have not ...
A unified approach for discovery of interesting association rules in medical databases
ICDM'06: Proceedings of the 6th Industrial Conference on Data Mining conference on Advances in Data Mining: applications in Medicine, Web Mining, Marketing, Image and Signal MiningAssociation rule discovery is an important technique for mining knowledge from large databases. Data mining researchers have studied subjective measures of interestingness to reduce the volume of discovered rules and to improve the overall efficiency of ...
On discovery of soft associations with "most" fuzzy quantifier for item promotion applications
In item promotion applications, there is a strong need for tools that can help to unlock the hidden profit within each individual customer's transaction history. Discovering association patterns based on the data mining technique is helpful for this ...
Comments