ABSTRACT
Projected and subspace clustering algorithms search for clusters of points in subsets of attributes. Projected clustering computes several disjoint clusters, plus outliers, so that each cluster exists in its own subset of attributes. Subspace clustering enumerates clusters of points in all subsets of attributes, typically producing many overlapping clusters. One problem of existing approaches is that their objectives are stated in a way that is not independent of the particular algorithm proposed to detect such clusters. A second problem is the definition of cluster density based on user-defined parameters, which makes it hard to assess whether the reported clusters are an artifact of the algorithm or whether they actually stand out in the data in a statistical sense.
We propose a novel problem formulation that aims at extracting axis-parallel regions that stand out in the data in a statistical sense. The set of axis-parallel, statistically significant regions that exist in a given data set is typically highly redundant. Therefore, we formulate the problem of representing this set through a reduced, non-redundant set of axis-parallel, statistically significant regions as an optimization problem. Exhaustive search is not a viable solution due to computational infeasibility, and we propose the approximation algorithm STATPC. Our comprehensive experimental evaluation shows that STATPC significantly outperforms existing projected and subspace clustering algorithms in terms of accuracy.
- E. Achtert, C. Böhm, H.-P. Kriegel, P. Kröger, I. Müller-Gorman, and A. Zimek. Detection and visualization of subspace clusters hierarchies. In DASFAA, 2007. Google ScholarDigital Library
- D. Agarwal, A. McGregor, J. Phillips, S. Venkatasubramanian, and Z. Zhu. Spatial scan statistics: approximations and performance study. In KDD, 2006. Google ScholarDigital Library
- C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fast algorithms for projected clustering. In SIGMOD, 1999. Google ScholarDigital Library
- R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In SIGMOD, 1998. Google ScholarDigital Library
- R. Agrawal and R. Srikan. Fast algorithms for mining association rules. In VLDB, 1994. Google ScholarDigital Library
- I. Assent, R. Krieger, E. Müller, and T. Seidl. DUSC: Dimensionality unbiased subspace clustering. In ICDM, 2007. Google ScholarDigital Library
- A. Baddeley. Spatial point processes and their applications. Lecture Notes in Mathematics, 1892: 1--75, 2007..Google ScholarCross Ref
- Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSS-B, 57:289--200, 1995..Google ScholarCross Ref
- K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbor meaningful? LNCS, 1540:217--235, 1999. Google ScholarDigital Library
- C. Böhm, K. Kailing, H.-P. Kriegel, and P. Kröger. Density connected clustering with local subspace preferences. In ICDM, 2004.Google ScholarCross Ref
- C. H. Cheng, A. W. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. In KDD, 1999. Google ScholarDigital Library
- J. Friedman and N. Fisher. Bump hunting in high-dimensional data. Statistics and Computing, 9:123--143, 1999. Google ScholarDigital Library
- K. Kailing, H. P. Kriegel, and P. Kröger. Density-connected subspace clustering for high-dimensional data. In SDM, 2004.Google ScholarCross Ref
- H. P. Kriegel, P. Kröger, M. Renz, and S. Wurst. A generic framework for efficient subspace clustering of high-dimensional data. In ICDM, 2005. Google ScholarDigital Library
- G. Liu, J. Li, K. Sim, and L. Wong. Distance based subspace clustering with flexible dimension partitioning. In ICDE, 2007.Google ScholarCross Ref
- G. Moise and J. Sander. TR08-03. Technical report, University of Alberta, http://www.cs.ualberta.ca/research/techreports/2008/TR08-03.php, 2008.Google Scholar
- G. Moise, J. Sander, and M. Ester. P3C: A robust projected clustering algorithm. In ICDM, 2006. Google ScholarDigital Library
- H. Nagesh, S. Goil, and A. Choudhary. Adaptive grids for clustering massive data sets. In SDM, 2001.Google ScholarCross Ref
- K. Ng, A. Fu, and C.-W. Wong. Projective clustering by histograms. IEEE TKDE, 17(3):369--383, 2005. Google ScholarDigital Library
- L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: a review. SIGKDD Explorations Newsletter, 6(1):90--105, 2004. Google ScholarDigital Library
- C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. Murali. A Monte Carlo algorithm for fast projective clustering. In SIGMOD, 2002. Google ScholarDigital Library
- K. Sequeira and M. Zaki. SCHISM: a new approach for interesting subspace mining. In ICDM, 2004. Google ScholarDigital Library
- G. W. Snedecor and W. G. Cochran. Statistical Methods. Iowa State University Press, 1989.Google Scholar
- K. Yip, D. Cheung, and M. Ng. HARP: a practical projected clustering algorithm. IEEE TKDE, 16(11):1387--1397, 2004. Google ScholarDigital Library
- K. Yip, D. Cheung, and M. Ng. On discovery of extremely low-dimensional clusters using semi-supervised projected clustering. In ICDE, 2005. Google ScholarDigital Library
- M. L. Yiu and N. Mamoulis. Iterative projected clustering by subspace mining. IEEE TKDE, 17(2):176--189, 2005. Google ScholarDigital Library
Index Terms
- Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering
Recommendations
Subspace clustering for high dimensional data: a review
Special issue on learning from imbalanced datasetsSubspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Often in high dimensional data, many dimensions are irrelevant and can mask existing clusters in noisy data. Feature ...
Robust projected clustering
Projected clustering partitions a data set into several disjoint clusters, plus outliers, so that each cluster exists in a subspace. Subspace clustering enumerates clusters of objects in all subspaces of a data set, and it tends to produce many ...
A survey on enhanced subspace clustering
Subspace clustering finds sets of objects that are homogeneous in subspaces of high-dimensional datasets, and has been successfully applied in many domains. In recent years, a new breed of subspace clustering algorithms, which we denote as enhanced ...
Comments