Abstract
The problem of assessing the significance of data mining results on high-dimensional 0--1 datasets has been studied extensively in the literature. For problems such as mining frequent sets and finding correlations, significance testing can be done by standard statistical tests such as chi-square, or other methods. However, the results of such tests depend only on the specific attributes and not on the dataset as a whole. Moreover, the tests are difficult to apply to sets of patterns or other complex results of data mining algorithms. In this article, we consider a simple randomization technique that deals with this shortcoming. The approach consists of producing random datasets that have the same row and column margins as the given dataset, computing the results of interest on the randomized instances and comparing them to the results on the actual data. This randomization technique can be used to assess the results of many different types of data mining algorithms, such as frequent sets, clustering, and spectral analysis. To generate random datasets with given margins, we use variations of a Markov chain approach which is based on a simple swap operation. We give theoretical results on the efficiency of different randomization methods, and apply the swap randomization method to several well-known datasets. Our results indicate that for some datasets the structure discovered by the data mining algorithms is expected, given the row and column margins of the datasets, while for other datasets the discovered structure conveys information that is not captured by the margin counts.
- Besag, J. 2004. Markov chain Monte Carlo methods for statistical inference. http://www.ims.nus.edu.sg/Programs/mcmc/files/besag_tl.pdf.Google Scholar
- Besag, J. and Clifford, P. 1989. Generalized Monte Carlo significance tests. Biometrika 76, 4, 633--642.Google ScholarCross Ref
- Besag, J. and Clifford, P. 1991. Sequential Monte Carlo p-values. Biometrika 78, 2, 301--304.Google ScholarCross Ref
- Bezáková, I., Bhatnagar, N., and Vigoda, E. 2006. Sampling binary contingency tables with a greedy start. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), SIAM, 414--423. Google ScholarDigital Library
- Bezáková, I., Sinclair, A., Stefankovic, D., and Vigoda, E. 2006. Negative examples for sequential importance sampling of binary contingency tables. http://arxiv.org/abs/math.ST/0606650.Google Scholar
- Brijs, T., Swinnen, G., Vanhoof, K., and Wets, G. 1999. Using association rules for product assortment decisions: A case study. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, 254--260. Google ScholarDigital Library
- Brin, S., Motwani, R., and Silverstein, C. 1997. Beyond market baskets: Generalizing association rules to correlations. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Tucson, AZ, 265--276. Google ScholarDigital Library
- Calders, T. 2004. Computational complexity of itemset frequency satisfiability. In Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 143--154. Google ScholarDigital Library
- Chen, Y., Diaconis, P., Holmes, S. P., and Liu, J. S. 2005. Sequential Monte Carlo methods for statistical analysis of tables. J. Amer. Statis. Assoc. 100, 469, 109--120.Google ScholarCross Ref
- Cobb, G. W. and Chen, Y.-P. 2003. An application of Markov chain Monte Carlo to community ecology. Amer. Math. Month. 110, 264--288.Google ScholarCross Ref
- Diaconis, P. and Gangolli, A. 1995. Rectangular arrays with fixed margins. In Discrete Probability and Algorithms, 15--41.Google Scholar
- Diaconis, P. and Saloff-Coste, L. 1995. Random walk on contingency tables with fixed row and column sums. Tech. Rep., Department of Mathematics, Harvard University.Google Scholar
- DuMouchel, W. and Pregibon, D. 2001. Empirical Bayes screening for multi-item associations. In Knowledge Discovery and Data Mining, 67--76. Google ScholarDigital Library
- Dyer, M. 2003. Approximate counting by dynamic programming. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing, San Diego, CA, 693--699. Google ScholarDigital Library
- Fortelius, M. 2006. Neogene of the old world database of fossil mammals (NOW). http://www.helsinki.fi/science/now/.Google Scholar
- Good, P. 2000. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer.Google ScholarCross Ref
- Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 1, 97--109.Google ScholarCross Ref
- Jaroszewicz, S. and Simovici, D. A. 2001. A general measure of rule interestingness. In Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), 253--265. Google ScholarDigital Library
- Kashtan, N., Itzkovitz, S., Milo, R., and Alon, U. 2004. Efficient sampling algorithm for estimating dubgraph concentrations and detecting network motifs. Bioinf. 20, 11, 1746--1758. Google ScholarDigital Library
- Liu, B., Hsu, W., and Ma, Y. 1999. Pruning and summarizing the discovered associations. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, 125--134. Google ScholarDigital Library
- Liu, B., Hsu, W., and Ma, Y. 2001. Identifying non-actionable association rules. In Knowledge Discovery and Data Mining, 329--334. Google ScholarDigital Library
- Megiddo, N. and Srikant, R. 1998. Discovering predictive association rules. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD), New York, 274--278.Google Scholar
- Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087--1092.Google ScholarCross Ref
- Mielikäinen, T. 2003. On inverse frequent set mining. In Proceedings of the 2nd Workshop on Privacy Preserving Data Mining (PPDM), IEEE Computer Society, 18--23.Google Scholar
- Milo, R., Shen-Orr, S., Itzkovirz, S., Kashtan, N., Chklovskii, D., and Alon, U. 2002. Network motifs: Simple building blocks of complex networks. Sci. 298, 824--827.Google ScholarCross Ref
- Newman, M. 2003. The structure and function of complex networks. SIAM Rev. 45, 2, 167--256.Google ScholarDigital Library
- Ryser, H. J. 1957. Combinatorial properties of matrices of zeros and ones. Canadian J. Math. 9, 371--377.Google ScholarCross Ref
- Sanderson, J. 2000. Testing ecological patterns. Amer. Sci. 88, 332--339.Google ScholarCross Ref
- Snijders, F. 1991. Enumeration and simulation methods for 0--1 matrices with given marginals. Psychometrika 56, 397--417.Google ScholarCross Ref
- Tan, P.-N., Kumar, V., and Srivastava, J. 2002. Selecting the right interestingness measure for association patterns. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, 32--41. Google ScholarDigital Library
- Tomkins, A. 2006. Private communication.Google Scholar
- Wang, B. Y. and Zhang, F. 1998. Precise number of (0, 1)-matrices in u(r, s). Discrete Math. 187, 211--220. Google ScholarDigital Library
- Webb, G. 2006. Discovering significant rules. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 434--443. Google ScholarDigital Library
- Webb, G. 2007. Discovering significant patterns. Mach. Learn., to appear. Google ScholarCross Ref
- Xiong, H., Shekhar, S., Tan, P.-N., and Kumar, V. 2004. Exploiting a support-based upper bound of pearson's correlation coefficient for efficiently identifying strongly correlated pairs. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, 334--343. Google ScholarDigital Library
Index Terms
- Assessing data mining results via swap randomization
Recommendations
Assessing data mining results via swap randomization
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data miningThe problem of assessing the significance of data mining results on high-dimensional 0-1 data sets has been studied extensively in the literature. For problems such as mining frequent sets and finding correlations, significance testing can be done by, ...
Randomization methods for assessing data analysis results on real-valued matrices
Randomization is an important technique for assessing the significance of data analysis results. Given an input dataset, a randomization method samples at random from some class of datasets that share certain characteristics with the original data. The ...
Assessing the Significance of Data Mining Results on Graphs with Feature Vectors
ICDM '12: Proceedings of the 2012 IEEE 12th International Conference on Data MiningAssessing the significance of data mining results is an important step in the knowledge discovery process. While results might appear interesting at a first glance, they can often be explained by already known characteristics of the data. Randomization ...
Comments