research-article

Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering

Authors:
Gabriela Moise

University of Alberta, Edmonton, AB, Canada

University of Alberta, Edmonton, AB, Canada
View Profile

,
Jörg Sander

University of Alberta, Edmonton, AB, Canada

University of Alberta, Edmonton, AB, Canada
View Profile

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2008Pages 533–541https://doi.org/10.1145/1401890.1401956

Published:24 August 2008Publication History

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 533–541

ABSTRACT

Projected and subspace clustering algorithms search for clusters of points in subsets of attributes. Projected clustering computes several disjoint clusters, plus outliers, so that each cluster exists in its own subset of attributes. Subspace clustering enumerates clusters of points in all subsets of attributes, typically producing many overlapping clusters. One problem of existing approaches is that their objectives are stated in a way that is not independent of the particular algorithm proposed to detect such clusters. A second problem is the definition of cluster density based on user-defined parameters, which makes it hard to assess whether the reported clusters are an artifact of the algorithm or whether they actually stand out in the data in a statistical sense.

We propose a novel problem formulation that aims at extracting axis-parallel regions that stand out in the data in a statistical sense. The set of axis-parallel, statistically significant regions that exist in a given data set is typically highly redundant. Therefore, we formulate the problem of representing this set through a reduced, non-redundant set of axis-parallel, statistically significant regions as an optimization problem. Exhaustive search is not a viable solution due to computational infeasibility, and we propose the approximation algorithm STATPC. Our comprehensive experimental evaluation shows that STATPC significantly outperforms existing projected and subspace clustering algorithms in terms of accuracy.

References

E. Achtert, C. Böhm, H.-P. Kriegel, P. Kröger, I. Müller-Gorman, and A. Zimek. Detection and visualization of subspace clusters hierarchies. In DASFAA, 2007. Google ScholarDigital Library
D. Agarwal, A. McGregor, J. Phillips, S. Venkatasubramanian, and Z. Zhu. Spatial scan statistics: approximations and performance study. In KDD, 2006. Google ScholarDigital Library
C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fast algorithms for projected clustering. In SIGMOD, 1999. Google ScholarDigital Library
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In SIGMOD, 1998. Google ScholarDigital Library
R. Agrawal and R. Srikan. Fast algorithms for mining association rules. In VLDB, 1994. Google ScholarDigital Library
I. Assent, R. Krieger, E. Müller, and T. Seidl. DUSC: Dimensionality unbiased subspace clustering. In ICDM, 2007. Google ScholarDigital Library
A. Baddeley. Spatial point processes and their applications. Lecture Notes in Mathematics, 1892: 1--75, 2007..Google ScholarCross Ref
Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSS-B, 57:289--200, 1995..Google ScholarCross Ref
K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbor meaningful? LNCS, 1540:217--235, 1999. Google ScholarDigital Library
C. Böhm, K. Kailing, H.-P. Kriegel, and P. Kröger. Density connected clustering with local subspace preferences. In ICDM, 2004.Google ScholarCross Ref
C. H. Cheng, A. W. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. In KDD, 1999. Google ScholarDigital Library
J. Friedman and N. Fisher. Bump hunting in high-dimensional data. Statistics and Computing, 9:123--143, 1999. Google ScholarDigital Library
K. Kailing, H. P. Kriegel, and P. Kröger. Density-connected subspace clustering for high-dimensional data. In SDM, 2004.Google ScholarCross Ref
H. P. Kriegel, P. Kröger, M. Renz, and S. Wurst. A generic framework for efficient subspace clustering of high-dimensional data. In ICDM, 2005. Google ScholarDigital Library
G. Liu, J. Li, K. Sim, and L. Wong. Distance based subspace clustering with flexible dimension partitioning. In ICDE, 2007.Google ScholarCross Ref
G. Moise and J. Sander. TR08-03. Technical report, University of Alberta, http://www.cs.ualberta.ca/research/techreports/2008/TR08-03.php, 2008.Google Scholar
G. Moise, J. Sander, and M. Ester. P3C: A robust projected clustering algorithm. In ICDM, 2006. Google ScholarDigital Library
H. Nagesh, S. Goil, and A. Choudhary. Adaptive grids for clustering massive data sets. In SDM, 2001.Google ScholarCross Ref
K. Ng, A. Fu, and C.-W. Wong. Projective clustering by histograms. IEEE TKDE, 17(3):369--383, 2005. Google ScholarDigital Library
L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: a review. SIGKDD Explorations Newsletter, 6(1):90--105, 2004. Google ScholarDigital Library
C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. Murali. A Monte Carlo algorithm for fast projective clustering. In SIGMOD, 2002. Google ScholarDigital Library
K. Sequeira and M. Zaki. SCHISM: a new approach for interesting subspace mining. In ICDM, 2004. Google ScholarDigital Library
G. W. Snedecor and W. G. Cochran. Statistical Methods. Iowa State University Press, 1989.Google Scholar
K. Yip, D. Cheung, and M. Ng. HARP: a practical projected clustering algorithm. IEEE TKDE, 16(11):1387--1397, 2004. Google ScholarDigital Library
K. Yip, D. Cheung, and M. Ng. On discovery of extremely low-dimensional clusters using semi-supervised projected clustering. In ICDE, 2005. Google ScholarDigital Library
M. L. Yiu and N. Mamoulis. Iterative projected clustering by subspace mining. IEEE TKDE, 17(2):176--189, 2005. Google ScholarDigital Library

Index Terms

Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Subspace clustering for high dimensional data: a review
Special issue on learning from imbalanced datasets

Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Often in high dimensional data, many dimensions are irrelevant and can mask existing clusters in noisy data. Feature ...
Read More
Robust projected clustering

Projected clustering partitions a data set into several disjoint clusters, plus outliers, so that each cluster exists in a subspace. Subspace clustering enumerates clusters of objects in all subspaces of a data set, and it tends to produce many ...
Read More
A survey on enhanced subspace clustering

Subspace clustering finds sets of objects that are homogeneous in subspaces of high-dimensional datasets, and has been successfully applied in many domains. In recent years, a new breed of subspace clustering algorithms, which we denote as enhanced ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2008
1116 pages
ISBN:9781605581934
DOI:10.1145/1401890
General Chair:
Ying Li
Microsoft adCenter Labs
,
Program Chairs:
Bing Liu
University of Illinois at Chicago
,
Sunita Sarawagi
Indian Institute of Technology, Bombay
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
projected clustering
subspace clustering
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 75
  Total Citations
  View Citations
- 1,125
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Subspace clustering for high dimensional data: a review

Robust projected clustering

A survey on enhanced subspace clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Subspace clustering for high dimensional data: a review

Robust projected clustering

A survey on enhanced subspace clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media