skip to main content
10.1145/1401890.1401956acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering

Published:24 August 2008Publication History

ABSTRACT

Projected and subspace clustering algorithms search for clusters of points in subsets of attributes. Projected clustering computes several disjoint clusters, plus outliers, so that each cluster exists in its own subset of attributes. Subspace clustering enumerates clusters of points in all subsets of attributes, typically producing many overlapping clusters. One problem of existing approaches is that their objectives are stated in a way that is not independent of the particular algorithm proposed to detect such clusters. A second problem is the definition of cluster density based on user-defined parameters, which makes it hard to assess whether the reported clusters are an artifact of the algorithm or whether they actually stand out in the data in a statistical sense.

We propose a novel problem formulation that aims at extracting axis-parallel regions that stand out in the data in a statistical sense. The set of axis-parallel, statistically significant regions that exist in a given data set is typically highly redundant. Therefore, we formulate the problem of representing this set through a reduced, non-redundant set of axis-parallel, statistically significant regions as an optimization problem. Exhaustive search is not a viable solution due to computational infeasibility, and we propose the approximation algorithm STATPC. Our comprehensive experimental evaluation shows that STATPC significantly outperforms existing projected and subspace clustering algorithms in terms of accuracy.

References

  1. E. Achtert, C. Böhm, H.-P. Kriegel, P. Kröger, I. Müller-Gorman, and A. Zimek. Detection and visualization of subspace clusters hierarchies. In DASFAA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Agarwal, A. McGregor, J. Phillips, S. Venkatasubramanian, and Z. Zhu. Spatial scan statistics: approximations and performance study. In KDD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fast algorithms for projected clustering. In SIGMOD, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In SIGMOD, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Agrawal and R. Srikan. Fast algorithms for mining association rules. In VLDB, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. I. Assent, R. Krieger, E. Müller, and T. Seidl. DUSC: Dimensionality unbiased subspace clustering. In ICDM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Baddeley. Spatial point processes and their applications. Lecture Notes in Mathematics, 1892: 1--75, 2007..Google ScholarGoogle ScholarCross RefCross Ref
  8. Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSS-B, 57:289--200, 1995..Google ScholarGoogle ScholarCross RefCross Ref
  9. K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbor meaningful? LNCS, 1540:217--235, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Böhm, K. Kailing, H.-P. Kriegel, and P. Kröger. Density connected clustering with local subspace preferences. In ICDM, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  11. C. H. Cheng, A. W. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. In KDD, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Friedman and N. Fisher. Bump hunting in high-dimensional data. Statistics and Computing, 9:123--143, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. Kailing, H. P. Kriegel, and P. Kröger. Density-connected subspace clustering for high-dimensional data. In SDM, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  14. H. P. Kriegel, P. Kröger, M. Renz, and S. Wurst. A generic framework for efficient subspace clustering of high-dimensional data. In ICDM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. Liu, J. Li, K. Sim, and L. Wong. Distance based subspace clustering with flexible dimension partitioning. In ICDE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  16. G. Moise and J. Sander. TR08-03. Technical report, University of Alberta, http://www.cs.ualberta.ca/research/techreports/2008/TR08-03.php, 2008.Google ScholarGoogle Scholar
  17. G. Moise, J. Sander, and M. Ester. P3C: A robust projected clustering algorithm. In ICDM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H. Nagesh, S. Goil, and A. Choudhary. Adaptive grids for clustering massive data sets. In SDM, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  19. K. Ng, A. Fu, and C.-W. Wong. Projective clustering by histograms. IEEE TKDE, 17(3):369--383, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: a review. SIGKDD Explorations Newsletter, 6(1):90--105, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. Murali. A Monte Carlo algorithm for fast projective clustering. In SIGMOD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Sequeira and M. Zaki. SCHISM: a new approach for interesting subspace mining. In ICDM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. G. W. Snedecor and W. G. Cochran. Statistical Methods. Iowa State University Press, 1989.Google ScholarGoogle Scholar
  24. K. Yip, D. Cheung, and M. Ng. HARP: a practical projected clustering algorithm. IEEE TKDE, 16(11):1387--1397, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Yip, D. Cheung, and M. Ng. On discovery of extremely low-dimensional clusters using semi-supervised projected clustering. In ICDE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. L. Yiu and N. Mamoulis. Iterative projected clustering by subspace mining. IEEE TKDE, 17(2):176--189, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
        August 2008
        1116 pages
        ISBN:9781605581934
        DOI:10.1145/1401890
        • General Chair:
        • Ying Li,
        • Program Chairs:
        • Bing Liu,
        • Sunita Sarawagi

        Copyright © 2008 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 August 2008

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader