Abstract
Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.We evaluate BIRCH's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior.
- CKS88 Peter C, heeselnan, James Kelly, Matthew Self, et al., Auto CIass : A Bayesian Classification System, Proc. of the 5th Int'l Conf. on Machine Learning, Morgan Kaufman, 3un. 1988.Google Scholar
- DH73 Richard Duds, and Peter E. Hart, Pattern Classification and Scene Analysis, Wiley, 1973.Google Scholar
- DJ80 R. Dubes, and A.K. Jain, Clustering Methodologies in Exploratory Data Analysis Advances in C, omputers, Edited by M.C. Yovits, Vol. 19, Academic Press, New York, 1980.Google Scholar
- EKX95a Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu, A Database Interface .for Clustering in Large Spatial Databases, Proc. of 1st {nt'l Conf. on Knowledge Discovery and Data Mining, 1995.Google Scholar
- EKX95b Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu, Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification, Proc. of 4th Int'l Symposium on Large Spatial Databases, Portland, Maine, U.S.A., 1995. Google ScholarDigital Library
- Fis87 Douglas H. Fisher, Knowledge Acquisition via Incremental Conceptual Clustering, Machine Learning, 2(2), 1987 Google ScholarDigital Library
- Fis95 Douglas H. Fisher, Iterative Optimization and Simplification of Hierarchical CIuaterings, Technical Report CS-95-01, Dept. of Computer Science, Vanderbilt l lniversity, Nashville, TN 37235.Google Scholar
- GG92 A. Gersho and P~. Gray, Vector quantization and signal compression, Boston, Ms.: Kluwer Academic Publishers, 1992. Google ScholarDigital Library
- KR90 Leonard Kaufman, and Peter J. Rousseeuw, Finding Groups in Data - An Introduction to Cluster Analysis, Wiley Series in Probability and Mathematical Statistics, 1990.Google Scholar
- Leb87 Michael Lebowitz, Experiments with Incremental Concept Formation : UNIMEM, Machine Learning, 1987. Google ScholarDigital Library
- Lee81 R.C.T.Lee, Clustering analysis and its applications, Advances in Information Systems Science, Edited by J .T.Toum, Vol. 8, pp. 169-292, Plenum Press, New York, 1981.Google Scholar
- Mur83 F. Murtagh, A Survey of _Recent Advances in Hier'archical Clustering Algorithms, The Computer Journal, 1933.Google Scholar
- NH94 Raymond T. Ng and Jiawei Hart, Efficient and Effective Clustering Methods for ,Spatial Data Mining, Proc. of VLDB, 1994. Google ScholarDigital Library
- Ols93 (:lark F. Olson, Parallel Algorithms for Hierarchical Clustering, Technical Report, Computer Science Division, l.lniv, of California at Berkeley, Dec.,1993. Google ScholarDigital Library
- ZRL95 Tian Zhang, Raghu Ramakrishnan, and Miron Livl~y, BIRCH: An Efficient Data Clustering Method .for Very Large Databases, Technical Report, Computer Sciences Dept., Univ. of Wisconsin-Madison, 1995.Google Scholar
Index Terms
- BIRCH: an efficient data clustering method for very large databases
Recommendations
BIRCH: an efficient data clustering method for very large databases
SIGMOD '96: Proceedings of the 1996 ACM SIGMOD international conference on Management of dataFinding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work ...
MR-BIRCH: A scalable MapReduce-based BIRCH clustering algorithm
Many classical clustering algorithms have been fitted into MapReduce, which provides a novel solution for clustering big data. However, several iterations are required to reach an acceptable result in most of the algorithms. For each iteration, a new ...
BIRCH: A New Data Clustering Algorithm and Its Applications
Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. Recently, there has been a growing ...
Comments