article

Why so many clustering algorithms: a position paper

Author:
Vladimir Estivill-Castro

University of Newcastle, Callaghan, NSW, Australia

University of Newcastle, Callaghan, NSW, Australia
View Profile

Authors Info & Claims

ACM SIGKDD Explorations Newsletter Volume 4 Issue 1June 2002pp 65–75https://doi.org/10.1145/568574.568575

Published:01 June 2002Publication History

ACM SIGKDD Explorations Newsletter

Abstract

We argue that there are many clustering algorithms, because the notion of "cluster" cannot be precisely defined. Clustering is in the eye of the beholder, and as such, researchers have proposed many induction principles and models whose corresponding optimization problem can only be approximately solved by an even larger number of algorithms. Therefore, comparing clustering algorithms, must take into account a careful understanding of the inductive principles involved.

References

C. Aggarwal. A human-computer cooperative system for effective high dimensional clustering. In Proceedings of the KDD Conference, pages 221-226, San Francisco, CA, August 26-29 2001. ACM-SIGKDD, ACM Press.]] Google ScholarDigital Library
R. Agrawal, R. Bayardo, and R. Srikant. Athena: Mining-based interactive management of text databases. In C. Zaniolo, P. Lockemann, M. Scholl, and T. Grust, editors, Extending Database Technology EDBT, 7th International Conference, volume 1777 of Lecture Notes in Computer Science, Konstanz, Germany, March 27-31, 2000. Springer.]] Google ScholarDigital Library
M. Aldenderfer and R. Blashfield. Cluster Analysis. Sage Publications, Beverly Hills, USA, 1984.]]Google ScholarCross Ref
C. Bajaj. Proving geometric algorithm non-solvability: An application of factoring polynomials. Journal of Symbolic Computation, 2:99-102, 1986.]] Google ScholarDigital Library
J. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York, 1981.]] Google ScholarDigital Library
J. Bezdek and N. Pal. Some new indexes of cluster validity. IEEE Transactions on System, Man and Cybernetics, Part B, 28:301-315, 1998.]]Google ScholarDigital Library
R. Bonner. On some clustering techniques. IBM Journal of Research and Development, 8:22-32, 1964.]]Google ScholarDigital Library
P. Brucker. On the complexity of clustering problems. In R. Henn, B. Korte, and W. Oetti, editors, Optimization and Operations Research: Proceedings of the workshop held at the University of Bonn, pages 45-54, Berlin, 1978. Springer Verlag Lecture Notes in Economics and Mathematical Systems 157.]]Google ScholarCross Ref
A. Dempster, N. Laird, and D. Rubin. Maximum likehood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1-38, 1977.]]Google Scholar
B. Dom. An information-theoretic external cluster-validity measure. IBM Research Report RJ 10219, IBM's Almaden Research Center, San Jose, CA, October 5th 2001.]]Google Scholar
R. Dubes. Cluster analysis and reated issues. In C. Chen, L. Pau, and P. Wnag, editors, Handbook of Pattern Recognition and Computer Vision, pages 3-32, River Edge, NJ, 1993. World Scientific Publiching Co. Chapter 1.1.]] Google ScholarDigital Library
R. Duda and P. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, NY, USA, 1973.]]Google Scholar
J. Dunn. Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4:95-104, 1974.]]Google ScholarCross Ref
U. Elsner. Graph partitioning: A survey. Technical Report 97-27, Technische Universit" at Chemnitz, December 1997.]]Google Scholar
M. Ester, H. Kriegel, S. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In E. Simoudis, J. Han, and U. Fayyad, editors, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 226-231, Menlo Park, CA, 1996. AAAI, AAAI Press.]]Google Scholar
V. Estivill-Castro. Hybrid genetic algorithms are better for spatial clustering. In R. Mizoguchi and J. Slaney, editors, Proceedings Sixth Pacific Rim International Conference on Artificial Intelligence PRICAI 2000, pages 424-434, Melbourne, Australia, 2000. Springer-Verlag Lecture Notes in Artificial Intelligence 1886.]]Google Scholar
V. Estivill-Castro and M. Houle. Data structures for minimization of total within-group distance for spatio-temporal clustering. In L. De Raedt and A. Siebes, editors, 5th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'01), pages 91-102, Freiburg, Germany, September 3-7 2001. Springer Verlag Lecture Notes in Artificial Intelligence 2168.]] Google ScholarDigital Library
V. Estivill-Castro and M. Houle. Fast minimization of total within-group distance. In J. Fong and M. Ng, editors, Proceedings of the International Workshop on Mining Spatial and Temporal Data in conjunction with the fifth Pacific-Asia Conference on Knowledge Discovery and Data Mining PAKDD-2001, pages 72-81, Hong Kong, April 15-18 2001. City University of Hong Kong.]]Google Scholar
V. Estivill-Castro and M. Houle. Robust distance-based clustering with applications to spatial data mining. Algorithmica, 30(2):216-242, June 2001.]]Google ScholarDigital Library
V. Estivill-Castro and A. Murray. Discovering associations in spatial data - an efficient medoid based approach. In X. Wu, R. Kotagiri, and K. Korb, editors, Proceedings of the 2nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-98), pages 110-121, Melbourne, Australia, 1998. Springer-Verlag Lecture Notes in Artificial Intelligence 1394.]] Google ScholarDigital Library
V. Estivill-Castro and J. Yang. A fast and robust general purpose clustering algorithm. In R. Mizoguchi and J. Slaney, editors, Proceedings Sixth Pacific Rim International Conference on Artificial Intelligence PRICAI 2000, pages 208-218, Melbourne, Australia, 2000. Springer-Verlag Lecture Notes in Artificial Intelligence 1886.]]Google Scholar
V. Estivill-Castro and J. Yang. Non-crisp clustering web visitors by fast, convergent and robust algorithms on access logs. In L. De Raedt and A. Siebes, editors, 5th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'01), pages 103-114, Freiburg, Germany, September 3-7 2001. Springer Verlag Lecture Notes in Artificial Intelligence 2168.]] Google ScholarDigital Library
B. Everitt. Cluster Analysis. Halsted Press, New York, USA, 2nd. edition, 1980.]]Google Scholar
U. Fayyad and R. Uthurusamy. Data mining and knowledge discovery in databases. Communications of the ACM, 39(11):24-26, Nov. 1996. Special issue on Data Mining.]] Google ScholarDigital Library
M. Garey and D. Johnson. Computers and Intractability --- A guide to the Theory of NP-Completeness. Freeman, NY, 1979.]] Google ScholarDigital Library
S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In Proceedings of ACM SIGMOD International Conference on Management of Data, volume 27, pages 73-84, New York, 1998. ACM, ACM Press.]] Google ScholarDigital Library
M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. KDnuggets:News, page Numver 19 item 16, September 2001. www.db-net.aueb.gr/mhalk/papers/validity_survey.pdf.]]Google Scholar
M. Halkidi, M. Vazirgianis, and Y. Batistakis. Quality scheme assessment in the clustering process. In H. Zighed, D. A. Komorowski and J. Zytkow, editors, Principles of Data Mining and Knowledge Discovery, 4th European Conference, PKDD, pages 265-276, Lyon, France, September, 13-16 2000. Springer Verlag Lecture Notes in Computer Science 1920.]] Google ScholarDigital Library
I. Hall, L. O. Özyurt and J. Bezdek. Clustering with a genetically optimized approach. IEEE Transactions on Evolutionary Computation, 3(2):103-112, July 1999.]]Google ScholarDigital Library
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Mateo, CA, 2000.]] Google ScholarDigital Library
A. Hinneburg and D. Keim. An efficient approach to clustering in large multimedia databases with noise. In Proc. 4rd Int. Conf. on Knowledge Discovery and Data Mining, pages 58-65, New York, August 1998. AAAI Press.]]Google Scholar
A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice-Hall, Inc., Englewood Cliffs, NJ, 1988. Advanced Reference Series: Computer Science.]] Google ScholarDigital Library
M. Jain, A. K. nd Murty and P. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3):264-320, September 1999.]] Google ScholarDigital Library
J. Kalbfleisch. Probability and Statistical Inference --- Volume 2: Statistical Inference. Springer-Verlag, NY, US., second edition, 1985.]]Google Scholar
G. Karypis, E.-H. Han, and V. Kumar. Chameleon: Hierachical clustering using dynamic modeling. Computer, 32(8):68-75, August 1999.]] Google ScholarDigital Library
L. Kaufman and P. Rousseuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, NY, USA, 1990.]]Google Scholar
W. Kloesgen and J. Zytkow. Machine discovery terminology. KDnuggets Publicatiosn and References http://www.kdnuggets.com/publications/index.html. http://orgwis.gmd.de/projects/explora/terms.html.]]Google Scholar
H. Kuhn. A note on Fermat's problem. Mathematical Programming, 4(1):98-107, 1973.]]Google ScholarCross Ref
H. Kuhn and E. Kuenne. An efficient algorithm for the numerical solution of the generalized Weber problem in spatial economics. Journal of Regional Science, 4(2):21-33,1962.]]Google ScholarCross Ref
J. MacQueen. Some methods for classification and analysis of multivariate observations. In L. Le Cam and J. Neyman, editors, 5th Berkley Symposium on Mathematical Statistics and Probability, pages 281-297, 1967. Volume 1.]]Google Scholar
R. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In J. Bocca, M. Jarke, and C. Zaniolo, editors, Proceedings of the 20th Conference on Very Large Data Bases (VLDB), pages 144-155, San Francisco, CA, 1994. Santiago, Chile, Morgan Kaufmann Publishers.]] Google ScholarDigital Library
N. Pal and J. Bezdel. On cluster validity for the fuzzy c-means model. IEEE Transactions on Fuzzy Systems, 3(3):370-379, August 1995.]]Google ScholarDigital Library
S. Ray and R. Turi. Determination of number of clusters in k-means clustering and application in colour image segmentation. In N. Pal, D, A. K., and J. Das, editors, Proceedings of the 4th International Conference on Advances in Pattern Recognition and Digital Techniques (ICAPRDT'99), pages 137-143, New Delhi, India, December 27-29 1999. Narosa Publishing House.]]Google Scholar
R. Rezaee, B. Lelieveldt, and J. Reiber. A new cluster validity index for the fuzzy c-mean. Pattern Recognition Letters, 19:237-246, 1998.]] Google ScholarDigital Library
J. Rissanen. Stochastic complexity. Journal of the Royal Statistical Society, Series B, 49(3):223-239, 1987.]]Google Scholar
P. Rousseeuw and A. Leroy. Robust regression and outlier detection. John Wiley & Sons, NY, USA, 1987.]] Google ScholarDigital Library
M. Tanner. Tools for Statistical Inference. Springer-Verlag, NY, US., 1993.]]Google Scholar
M. Teitz and P. Bart. Heuristic methods for estimating the generalized vertex median of a weighted graph. Operations Research, 16:955-961, 1968.]]Google ScholarDigital Library
S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, NY, USA, 1999.]] Google ScholarDigital Library
D. Titterington, A. Smith, and U. Makov. Statistical Analysis of Finite Mixture Distributions. John Wiley & sons, UK, 1985.]]Google Scholar
C. Wallace and D. Boulton. An information measure for classification. Computer Journal, 11:185-195, 1968.]]Google ScholarCross Ref
R. Wilcox. Introduction to Robust Estimation and Hypothesis Testing. Academic Press, San Diego, CA, 1997.]]Google Scholar
M. Windham. Cluster validity for the fuzzy c-means clustering algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 4(4):357-363, 1982.]]Google ScholarDigital Library
I. Witten and E. Frank. Data Mining --- Practical Machine Learning Tools and Technologies with JAVA implementations. Morgan Kaufmann Publishers, San Mateo, CA, 2000.]] Google ScholarDigital Library
O. Zamir and O. Etzioni. Web document clustering: a feasibility demonstration. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR'98), pages 46-54, 1998.]] Google ScholarDigital Library
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. SIGMOD Record, 25(2):103-114, June 1996. Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data.]] Google ScholarDigital Library

Recommendations

Data clustering: a review

Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad ...
Read More
DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN
Invited Paper from SIGMOD 2015, Invited Paper from PODS 2015, Regular Papers and Technical Correspondence

At SIGMOD 2015, an article was presented with the title “DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation” that won the conference’s best paper award. In this technical correspondence, we want to point out some inaccuracies in the way ...
Read More
A density-based algorithm for discovering clusters in large spatial databases with noise
KDD'96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining

Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGKDD Explorations Newsletter Volume 4, Issue 1
June 2002
75 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/568574
Issue’s Table of Contents

Copyright © 2002 Author
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2002
Check for updates
Author Tags
clustering
clustering criterion
inductive principle
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 421
  Total Citations
  View Citations
- 4,803
  Total Downloads
- Downloads (Last 12 months)185
- Downloads (Last 6 weeks)18
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Why so many clustering algorithms: a position paper

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Recommendations

Data clustering: a review

DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN

A density-based algorithm for discovering clusters in large spatial databases with noise

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Why so many clustering algorithms: a position paper

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Recommendations

Data clustering: a review

DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN

A density-based algorithm for discovering clusters in large spatial databases with noise

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media