article

Assessing data mining results via swap randomization

Authors:
Aristides Gionis

Yahoo! Research, Barcelona, Spain

Yahoo! Research, Barcelona, Spain
View Profile

,
Heikki Mannila

University of Helsinki and Helsinki University of Technology, Helsinki, Finland

University of Helsinki and Helsinki University of Technology, Helsinki, Finland
View Profile

,
Taneli Mielikäinen

Nokia Research Center, Palo Alto, CA

Nokia Research Center, Palo Alto, CA
View Profile

,
Panayiotis Tsaparas

Search Labs, Microsoft Research, Mountain View, CA

Search Labs, Microsoft Research, Mountain View, CA
View Profile

Authors Info & Claims

ACM Transactions on Knowledge Discovery from Data Volume 1 Issue 3pp 14–eshttps://doi.org/10.1145/1297332.1297338

Published:01 December 2007Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

The problem of assessing the significance of data mining results on high-dimensional 0--1 datasets has been studied extensively in the literature. For problems such as mining frequent sets and finding correlations, significance testing can be done by standard statistical tests such as chi-square, or other methods. However, the results of such tests depend only on the specific attributes and not on the dataset as a whole. Moreover, the tests are difficult to apply to sets of patterns or other complex results of data mining algorithms. In this article, we consider a simple randomization technique that deals with this shortcoming. The approach consists of producing random datasets that have the same row and column margins as the given dataset, computing the results of interest on the randomized instances and comparing them to the results on the actual data. This randomization technique can be used to assess the results of many different types of data mining algorithms, such as frequent sets, clustering, and spectral analysis. To generate random datasets with given margins, we use variations of a Markov chain approach which is based on a simple swap operation. We give theoretical results on the efficiency of different randomization methods, and apply the swap randomization method to several well-known datasets. Our results indicate that for some datasets the structure discovered by the data mining algorithms is expected, given the row and column margins of the datasets, while for other datasets the discovered structure conveys information that is not captured by the margin counts.

References

Besag, J. 2004. Markov chain Monte Carlo methods for statistical inference. http://www.ims.nus.edu.sg/Programs/mcmc/files/besag_tl.pdf.Google Scholar
Besag, J. and Clifford, P. 1989. Generalized Monte Carlo significance tests. Biometrika 76, 4, 633--642.Google ScholarCross Ref
Besag, J. and Clifford, P. 1991. Sequential Monte Carlo p-values. Biometrika 78, 2, 301--304.Google ScholarCross Ref
Bezáková, I., Bhatnagar, N., and Vigoda, E. 2006. Sampling binary contingency tables with a greedy start. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), SIAM, 414--423. Google ScholarDigital Library
Bezáková, I., Sinclair, A., Stefankovic, D., and Vigoda, E. 2006. Negative examples for sequential importance sampling of binary contingency tables. http://arxiv.org/abs/math.ST/0606650.Google Scholar
Brijs, T., Swinnen, G., Vanhoof, K., and Wets, G. 1999. Using association rules for product assortment decisions: A case study. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, 254--260. Google ScholarDigital Library
Brin, S., Motwani, R., and Silverstein, C. 1997. Beyond market baskets: Generalizing association rules to correlations. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Tucson, AZ, 265--276. Google ScholarDigital Library
Calders, T. 2004. Computational complexity of itemset frequency satisfiability. In Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 143--154. Google ScholarDigital Library
Chen, Y., Diaconis, P., Holmes, S. P., and Liu, J. S. 2005. Sequential Monte Carlo methods for statistical analysis of tables. J. Amer. Statis. Assoc. 100, 469, 109--120.Google ScholarCross Ref
Cobb, G. W. and Chen, Y.-P. 2003. An application of Markov chain Monte Carlo to community ecology. Amer. Math. Month. 110, 264--288.Google ScholarCross Ref
Diaconis, P. and Gangolli, A. 1995. Rectangular arrays with fixed margins. In Discrete Probability and Algorithms, 15--41.Google Scholar
Diaconis, P. and Saloff-Coste, L. 1995. Random walk on contingency tables with fixed row and column sums. Tech. Rep., Department of Mathematics, Harvard University.Google Scholar
DuMouchel, W. and Pregibon, D. 2001. Empirical Bayes screening for multi-item associations. In Knowledge Discovery and Data Mining, 67--76. Google ScholarDigital Library
Dyer, M. 2003. Approximate counting by dynamic programming. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing, San Diego, CA, 693--699. Google ScholarDigital Library
Fortelius, M. 2006. Neogene of the old world database of fossil mammals (NOW). http://www.helsinki.fi/science/now/.Google Scholar
Good, P. 2000. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer.Google ScholarCross Ref
Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 1, 97--109.Google ScholarCross Ref
Jaroszewicz, S. and Simovici, D. A. 2001. A general measure of rule interestingness. In Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), 253--265. Google ScholarDigital Library
Kashtan, N., Itzkovitz, S., Milo, R., and Alon, U. 2004. Efficient sampling algorithm for estimating dubgraph concentrations and detecting network motifs. Bioinf. 20, 11, 1746--1758. Google ScholarDigital Library
Liu, B., Hsu, W., and Ma, Y. 1999. Pruning and summarizing the discovered associations. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, 125--134. Google ScholarDigital Library
Liu, B., Hsu, W., and Ma, Y. 2001. Identifying non-actionable association rules. In Knowledge Discovery and Data Mining, 329--334. Google ScholarDigital Library
Megiddo, N. and Srikant, R. 1998. Discovering predictive association rules. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD), New York, 274--278.Google Scholar
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087--1092.Google ScholarCross Ref
Mielikäinen, T. 2003. On inverse frequent set mining. In Proceedings of the 2nd Workshop on Privacy Preserving Data Mining (PPDM), IEEE Computer Society, 18--23.Google Scholar
Milo, R., Shen-Orr, S., Itzkovirz, S., Kashtan, N., Chklovskii, D., and Alon, U. 2002. Network motifs: Simple building blocks of complex networks. Sci. 298, 824--827.Google ScholarCross Ref
Newman, M. 2003. The structure and function of complex networks. SIAM Rev. 45, 2, 167--256.Google ScholarDigital Library
Ryser, H. J. 1957. Combinatorial properties of matrices of zeros and ones. Canadian J. Math. 9, 371--377.Google ScholarCross Ref
Sanderson, J. 2000. Testing ecological patterns. Amer. Sci. 88, 332--339.Google ScholarCross Ref
Snijders, F. 1991. Enumeration and simulation methods for 0--1 matrices with given marginals. Psychometrika 56, 397--417.Google ScholarCross Ref
Tan, P.-N., Kumar, V., and Srivastava, J. 2002. Selecting the right interestingness measure for association patterns. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, 32--41. Google ScholarDigital Library
Tomkins, A. 2006. Private communication.Google Scholar
Wang, B. Y. and Zhang, F. 1998. Precise number of (0, 1)-matrices in u(r, s). Discrete Math. 187, 211--220. Google ScholarDigital Library
Webb, G. 2006. Discovering significant rules. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 434--443. Google ScholarDigital Library
Webb, G. 2007. Discovering significant patterns. Mach. Learn., to appear. Google ScholarCross Ref
Xiong, H., Shekhar, S., Tan, P.-N., and Kumar, V. 2004. Exploiting a support-based upper bound of pearson's correlation coefficient for efficiently identifying strongly correlated pairs. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, 334--343. Google ScholarDigital Library

Index Terms

Assessing data mining results via swap randomization
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Assessing data mining results via swap randomization
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

The problem of assessing the significance of data mining results on high-dimensional 0-1 data sets has been studied extensively in the literature. For problems such as mining frequent sets and finding correlations, significance testing can be done by, ...
Read More
Randomization methods for assessing data analysis results on real-valued matrices

Randomization is an important technique for assessing the significance of data analysis results. Given an input dataset, a randomization method samples at random from some class of datasets that share certain characteristics with the original data. The ...
Read More
Assessing the Significance of Data Mining Results on Graphs with Feature Vectors
ICDM '12: Proceedings of the 2012 IEEE 12th International Conference on Data Mining

Assessing the significance of data mining results is an important step in the knowledge discovery process. While results might appear interesting at a first glance, they can often be explained by already known characteristics of the data. Randomization ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Knowledge Discovery from Data Volume 1, Issue 3
December 2007
145 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/1297332
Issue’s Table of Contents

Copyright © 2007 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 December 2007
Published in tkdd Volume 1, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
0--1 data
Significance testing
randomization tests
swaps
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 151
  Total Citations
  View Citations
- 1,737
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Assessing data mining results via swap randomization

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Assessing data mining results via swap randomization

Randomization methods for assessing data analysis results on real-valued matrices

Assessing the Significance of Data Mining Results on Graphs with Feature Vectors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Assessing data mining results via swap randomization

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Assessing data mining results via swap randomization

Randomization methods for assessing data analysis results on real-valued matrices

Assessing the Significance of Data Mining Results on Graphs with Feature Vectors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media