skip to main content
article

Assessing data mining results via swap randomization

Published:01 December 2007Publication History
Skip Abstract Section

Abstract

The problem of assessing the significance of data mining results on high-dimensional 0--1 datasets has been studied extensively in the literature. For problems such as mining frequent sets and finding correlations, significance testing can be done by standard statistical tests such as chi-square, or other methods. However, the results of such tests depend only on the specific attributes and not on the dataset as a whole. Moreover, the tests are difficult to apply to sets of patterns or other complex results of data mining algorithms. In this article, we consider a simple randomization technique that deals with this shortcoming. The approach consists of producing random datasets that have the same row and column margins as the given dataset, computing the results of interest on the randomized instances and comparing them to the results on the actual data. This randomization technique can be used to assess the results of many different types of data mining algorithms, such as frequent sets, clustering, and spectral analysis. To generate random datasets with given margins, we use variations of a Markov chain approach which is based on a simple swap operation. We give theoretical results on the efficiency of different randomization methods, and apply the swap randomization method to several well-known datasets. Our results indicate that for some datasets the structure discovered by the data mining algorithms is expected, given the row and column margins of the datasets, while for other datasets the discovered structure conveys information that is not captured by the margin counts.

References

  1. Besag, J. 2004. Markov chain Monte Carlo methods for statistical inference. http://www.ims.nus.edu.sg/Programs/mcmc/files/besag_tl.pdf.Google ScholarGoogle Scholar
  2. Besag, J. and Clifford, P. 1989. Generalized Monte Carlo significance tests. Biometrika 76, 4, 633--642.Google ScholarGoogle ScholarCross RefCross Ref
  3. Besag, J. and Clifford, P. 1991. Sequential Monte Carlo p-values. Biometrika 78, 2, 301--304.Google ScholarGoogle ScholarCross RefCross Ref
  4. Bezáková, I., Bhatnagar, N., and Vigoda, E. 2006. Sampling binary contingency tables with a greedy start. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), SIAM, 414--423. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bezáková, I., Sinclair, A., Stefankovic, D., and Vigoda, E. 2006. Negative examples for sequential importance sampling of binary contingency tables. http://arxiv.org/abs/math.ST/0606650.Google ScholarGoogle Scholar
  6. Brijs, T., Swinnen, G., Vanhoof, K., and Wets, G. 1999. Using association rules for product assortment decisions: A case study. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, 254--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Brin, S., Motwani, R., and Silverstein, C. 1997. Beyond market baskets: Generalizing association rules to correlations. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Tucson, AZ, 265--276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Calders, T. 2004. Computational complexity of itemset frequency satisfiability. In Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 143--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chen, Y., Diaconis, P., Holmes, S. P., and Liu, J. S. 2005. Sequential Monte Carlo methods for statistical analysis of tables. J. Amer. Statis. Assoc. 100, 469, 109--120.Google ScholarGoogle ScholarCross RefCross Ref
  10. Cobb, G. W. and Chen, Y.-P. 2003. An application of Markov chain Monte Carlo to community ecology. Amer. Math. Month. 110, 264--288.Google ScholarGoogle ScholarCross RefCross Ref
  11. Diaconis, P. and Gangolli, A. 1995. Rectangular arrays with fixed margins. In Discrete Probability and Algorithms, 15--41.Google ScholarGoogle Scholar
  12. Diaconis, P. and Saloff-Coste, L. 1995. Random walk on contingency tables with fixed row and column sums. Tech. Rep., Department of Mathematics, Harvard University.Google ScholarGoogle Scholar
  13. DuMouchel, W. and Pregibon, D. 2001. Empirical Bayes screening for multi-item associations. In Knowledge Discovery and Data Mining, 67--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dyer, M. 2003. Approximate counting by dynamic programming. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing, San Diego, CA, 693--699. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Fortelius, M. 2006. Neogene of the old world database of fossil mammals (NOW). http://www.helsinki.fi/science/now/.Google ScholarGoogle Scholar
  16. Good, P. 2000. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer.Google ScholarGoogle ScholarCross RefCross Ref
  17. Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 1, 97--109.Google ScholarGoogle ScholarCross RefCross Ref
  18. Jaroszewicz, S. and Simovici, D. A. 2001. A general measure of rule interestingness. In Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), 253--265. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kashtan, N., Itzkovitz, S., Milo, R., and Alon, U. 2004. Efficient sampling algorithm for estimating dubgraph concentrations and detecting network motifs. Bioinf. 20, 11, 1746--1758. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Liu, B., Hsu, W., and Ma, Y. 1999. Pruning and summarizing the discovered associations. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, 125--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Liu, B., Hsu, W., and Ma, Y. 2001. Identifying non-actionable association rules. In Knowledge Discovery and Data Mining, 329--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Megiddo, N. and Srikant, R. 1998. Discovering predictive association rules. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD), New York, 274--278.Google ScholarGoogle Scholar
  23. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087--1092.Google ScholarGoogle ScholarCross RefCross Ref
  24. Mielikäinen, T. 2003. On inverse frequent set mining. In Proceedings of the 2nd Workshop on Privacy Preserving Data Mining (PPDM), IEEE Computer Society, 18--23.Google ScholarGoogle Scholar
  25. Milo, R., Shen-Orr, S., Itzkovirz, S., Kashtan, N., Chklovskii, D., and Alon, U. 2002. Network motifs: Simple building blocks of complex networks. Sci. 298, 824--827.Google ScholarGoogle ScholarCross RefCross Ref
  26. Newman, M. 2003. The structure and function of complex networks. SIAM Rev. 45, 2, 167--256.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Ryser, H. J. 1957. Combinatorial properties of matrices of zeros and ones. Canadian J. Math. 9, 371--377.Google ScholarGoogle ScholarCross RefCross Ref
  28. Sanderson, J. 2000. Testing ecological patterns. Amer. Sci. 88, 332--339.Google ScholarGoogle ScholarCross RefCross Ref
  29. Snijders, F. 1991. Enumeration and simulation methods for 0--1 matrices with given marginals. Psychometrika 56, 397--417.Google ScholarGoogle ScholarCross RefCross Ref
  30. Tan, P.-N., Kumar, V., and Srivastava, J. 2002. Selecting the right interestingness measure for association patterns. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, 32--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Tomkins, A. 2006. Private communication.Google ScholarGoogle Scholar
  32. Wang, B. Y. and Zhang, F. 1998. Precise number of (0, 1)-matrices in u(r, s). Discrete Math. 187, 211--220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Webb, G. 2006. Discovering significant rules. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 434--443. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Webb, G. 2007. Discovering significant patterns. Mach. Learn., to appear. Google ScholarGoogle ScholarCross RefCross Ref
  35. Xiong, H., Shekhar, S., Tan, P.-N., and Kumar, V. 2004. Exploiting a support-based upper bound of pearson's correlation coefficient for efficiently identifying strongly correlated pairs. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, 334--343. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Assessing data mining results via swap randomization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data  Volume 1, Issue 3
      December 2007
      145 pages
      ISSN:1556-4681
      EISSN:1556-472X
      DOI:10.1145/1297332
      Issue’s Table of Contents

      Copyright © 2007 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 December 2007
      Published in tkdd Volume 1, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader