BIRCH: an efficient data clustering method for very large databases

Authors:
Tian Zhang

Computer Sciences Dept., Univ. of Wisconsin-Madison

Computer Sciences Dept., Univ. of Wisconsin-Madison
View Profile

,
Raghu Ramakrishnan

Computer Sciences Dept., Univ. of Wisconsin-Madison

Computer Sciences Dept., Univ. of Wisconsin-Madison
View Profile

,
Miron Livny

Computer Sciences Dept., Univ. of Wisconsin-Madison

Computer Sciences Dept., Univ. of Wisconsin-Madison
View Profile

Authors Info & Claims

ACM SIGMOD Record Volume 25 Issue 2June 1996pp 103–114https://doi.org/10.1145/235968.233324

Published:01 June 1996Publication History

ACM SIGMOD Record

Abstract

Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.We evaluate BIRCH's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior.

References

CKS88 Peter C, heeselnan, James Kelly, Matthew Self, et al., Auto CIass : A Bayesian Classification System, Proc. of the 5th Int'l Conf. on Machine Learning, Morgan Kaufman, 3un. 1988.Google Scholar
DH73 Richard Duds, and Peter E. Hart, Pattern Classification and Scene Analysis, Wiley, 1973.Google Scholar
DJ80 R. Dubes, and A.K. Jain, Clustering Methodologies in Exploratory Data Analysis Advances in C, omputers, Edited by M.C. Yovits, Vol. 19, Academic Press, New York, 1980.Google Scholar
EKX95a Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu, A Database Interface .for Clustering in Large Spatial Databases, Proc. of 1st {nt'l Conf. on Knowledge Discovery and Data Mining, 1995.Google Scholar
EKX95b Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu, Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification, Proc. of 4th Int'l Symposium on Large Spatial Databases, Portland, Maine, U.S.A., 1995. Google ScholarDigital Library
Fis87 Douglas H. Fisher, Knowledge Acquisition via Incremental Conceptual Clustering, Machine Learning, 2(2), 1987 Google ScholarDigital Library
Fis95 Douglas H. Fisher, Iterative Optimization and Simplification of Hierarchical CIuaterings, Technical Report CS-95-01, Dept. of Computer Science, Vanderbilt l lniversity, Nashville, TN 37235.Google Scholar
GG92 A. Gersho and P~. Gray, Vector quantization and signal compression, Boston, Ms.: Kluwer Academic Publishers, 1992. Google ScholarDigital Library
KR90 Leonard Kaufman, and Peter J. Rousseeuw, Finding Groups in Data - An Introduction to Cluster Analysis, Wiley Series in Probability and Mathematical Statistics, 1990.Google Scholar
Leb87 Michael Lebowitz, Experiments with Incremental Concept Formation : UNIMEM, Machine Learning, 1987. Google ScholarDigital Library
Lee81 R.C.T.Lee, Clustering analysis and its applications, Advances in Information Systems Science, Edited by J .T.Toum, Vol. 8, pp. 169-292, Plenum Press, New York, 1981.Google Scholar
Mur83 F. Murtagh, A Survey of _Recent Advances in Hier'archical Clustering Algorithms, The Computer Journal, 1933.Google Scholar
NH94 Raymond T. Ng and Jiawei Hart, Efficient and Effective Clustering Methods for ,Spatial Data Mining, Proc. of VLDB, 1994. Google ScholarDigital Library
Ols93 (:lark F. Olson, Parallel Algorithms for Hierarchical Clustering, Technical Report, Computer Science Division, l.lniv, of California at Berkeley, Dec.,1993. Google ScholarDigital Library
ZRL95 Tian Zhang, Raghu Ramakrishnan, and Miron Livl~y, BIRCH: An Efficient Data Clustering Method .for Very Large Databases, Technical Report, Computer Sciences Dept., Univ. of Wisconsin-Madison, 1995.Google Scholar

Index Terms

BIRCH: an efficient data clustering method for very large databases
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

BIRCH: an efficient data clustering method for very large databases
SIGMOD '96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data

Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work ...
Read More
MR-BIRCH: A scalable MapReduce-based BIRCH clustering algorithm

Many classical clustering algorithms have been fitted into MapReduce, which provides a novel solution for clustering big data. However, several iterations are required to reach an acceptable result in most of the algorithms. For each iteration, a new ...
Read More
BIRCH: A New Data Clustering Algorithm and Its Applications

Data clustering is an important technique for exploratory data analysis, and has been studied for several years. It has been shown to be useful in many practical domains such as data classification and image processing. Recently, there has been a growing ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGMOD Record Volume 25, Issue 2
June 1996
557 pages
ISSN:0163-5808
DOI:10.1145/235968
Chairman:
T. H. Merrett
McGill Univ.
,
Editors:
H. V. Jagadish,
Inderpal Singh Mumick
Issue’s Table of Contents
SIGMOD '96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data
June 1996
560 pages
ISBN:0897917944
DOI:10.1145/233269
Editor:
Jennifer Widom
Stanford Univ., Stanford, CT
Copyright © 1996 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 1996
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3,120
  Total Citations
  View Citations
- 17,015
  Total Downloads
- Downloads (Last 12 months)2,279
- Downloads (Last 6 weeks)385
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

BIRCH: an efficient data clustering method for very large databases

ACM SIGMOD Record

Abstract

References

Cited By

Index Terms

Recommendations

BIRCH: an efficient data clustering method for very large databases

MR-BIRCH: A scalable MapReduce-based BIRCH clustering algorithm

BIRCH: A New Data Clustering Algorithm and Its Applications