skip to main content
article

Why so many clustering algorithms: a position paper

Published:01 June 2002Publication History
Skip Abstract Section

Abstract

We argue that there are many clustering algorithms, because the notion of "cluster" cannot be precisely defined. Clustering is in the eye of the beholder, and as such, researchers have proposed many induction principles and models whose corresponding optimization problem can only be approximately solved by an even larger number of algorithms. Therefore, comparing clustering algorithms, must take into account a careful understanding of the inductive principles involved.

References

  1. C. Aggarwal. A human-computer cooperative system for effective high dimensional clustering. In Proceedings of the KDD Conference, pages 221-226, San Francisco, CA, August 26-29 2001. ACM-SIGKDD, ACM Press.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Agrawal, R. Bayardo, and R. Srikant. Athena: Mining-based interactive management of text databases. In C. Zaniolo, P. Lockemann, M. Scholl, and T. Grust, editors, Extending Database Technology EDBT, 7th International Conference, volume 1777 of Lecture Notes in Computer Science, Konstanz, Germany, March 27-31, 2000. Springer.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Aldenderfer and R. Blashfield. Cluster Analysis. Sage Publications, Beverly Hills, USA, 1984.]]Google ScholarGoogle ScholarCross RefCross Ref
  4. C. Bajaj. Proving geometric algorithm non-solvability: An application of factoring polynomials. Journal of Symbolic Computation, 2:99-102, 1986.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York, 1981.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Bezdek and N. Pal. Some new indexes of cluster validity. IEEE Transactions on System, Man and Cybernetics, Part B, 28:301-315, 1998.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Bonner. On some clustering techniques. IBM Journal of Research and Development, 8:22-32, 1964.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Brucker. On the complexity of clustering problems. In R. Henn, B. Korte, and W. Oetti, editors, Optimization and Operations Research: Proceedings of the workshop held at the University of Bonn, pages 45-54, Berlin, 1978. Springer Verlag Lecture Notes in Economics and Mathematical Systems 157.]]Google ScholarGoogle ScholarCross RefCross Ref
  9. A. Dempster, N. Laird, and D. Rubin. Maximum likehood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1-38, 1977.]]Google ScholarGoogle Scholar
  10. B. Dom. An information-theoretic external cluster-validity measure. IBM Research Report RJ 10219, IBM's Almaden Research Center, San Jose, CA, October 5th 2001.]]Google ScholarGoogle Scholar
  11. R. Dubes. Cluster analysis and reated issues. In C. Chen, L. Pau, and P. Wnag, editors, Handbook of Pattern Recognition and Computer Vision, pages 3-32, River Edge, NJ, 1993. World Scientific Publiching Co. Chapter 1.1.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Duda and P. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, NY, USA, 1973.]]Google ScholarGoogle Scholar
  13. J. Dunn. Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4:95-104, 1974.]]Google ScholarGoogle ScholarCross RefCross Ref
  14. U. Elsner. Graph partitioning: A survey. Technical Report 97-27, Technische Universit" at Chemnitz, December 1997.]]Google ScholarGoogle Scholar
  15. M. Ester, H. Kriegel, S. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In E. Simoudis, J. Han, and U. Fayyad, editors, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 226-231, Menlo Park, CA, 1996. AAAI, AAAI Press.]]Google ScholarGoogle Scholar
  16. V. Estivill-Castro. Hybrid genetic algorithms are better for spatial clustering. In R. Mizoguchi and J. Slaney, editors, Proceedings Sixth Pacific Rim International Conference on Artificial Intelligence PRICAI 2000, pages 424-434, Melbourne, Australia, 2000. Springer-Verlag Lecture Notes in Artificial Intelligence 1886.]]Google ScholarGoogle Scholar
  17. V. Estivill-Castro and M. Houle. Data structures for minimization of total within-group distance for spatio-temporal clustering. In L. De Raedt and A. Siebes, editors, 5th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'01), pages 91-102, Freiburg, Germany, September 3-7 2001. Springer Verlag Lecture Notes in Artificial Intelligence 2168.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. V. Estivill-Castro and M. Houle. Fast minimization of total within-group distance. In J. Fong and M. Ng, editors, Proceedings of the International Workshop on Mining Spatial and Temporal Data in conjunction with the fifth Pacific-Asia Conference on Knowledge Discovery and Data Mining PAKDD-2001, pages 72-81, Hong Kong, April 15-18 2001. City University of Hong Kong.]]Google ScholarGoogle Scholar
  19. V. Estivill-Castro and M. Houle. Robust distance-based clustering with applications to spatial data mining. Algorithmica, 30(2):216-242, June 2001.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. V. Estivill-Castro and A. Murray. Discovering associations in spatial data - an efficient medoid based approach. In X. Wu, R. Kotagiri, and K. Korb, editors, Proceedings of the 2nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-98), pages 110-121, Melbourne, Australia, 1998. Springer-Verlag Lecture Notes in Artificial Intelligence 1394.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. V. Estivill-Castro and J. Yang. A fast and robust general purpose clustering algorithm. In R. Mizoguchi and J. Slaney, editors, Proceedings Sixth Pacific Rim International Conference on Artificial Intelligence PRICAI 2000, pages 208-218, Melbourne, Australia, 2000. Springer-Verlag Lecture Notes in Artificial Intelligence 1886.]]Google ScholarGoogle Scholar
  22. V. Estivill-Castro and J. Yang. Non-crisp clustering web visitors by fast, convergent and robust algorithms on access logs. In L. De Raedt and A. Siebes, editors, 5th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'01), pages 103-114, Freiburg, Germany, September 3-7 2001. Springer Verlag Lecture Notes in Artificial Intelligence 2168.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Everitt. Cluster Analysis. Halsted Press, New York, USA, 2nd. edition, 1980.]]Google ScholarGoogle Scholar
  24. U. Fayyad and R. Uthurusamy. Data mining and knowledge discovery in databases. Communications of the ACM, 39(11):24-26, Nov. 1996. Special issue on Data Mining.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Garey and D. Johnson. Computers and Intractability --- A guide to the Theory of NP-Completeness. Freeman, NY, 1979.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In Proceedings of ACM SIGMOD International Conference on Management of Data, volume 27, pages 73-84, New York, 1998. ACM, ACM Press.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. KDnuggets:News, page Numver 19 item 16, September 2001. www.db-net.aueb.gr/mhalk/papers/validity_survey.pdf.]]Google ScholarGoogle Scholar
  28. M. Halkidi, M. Vazirgianis, and Y. Batistakis. Quality scheme assessment in the clustering process. In H. Zighed, D. A. Komorowski and J. Zytkow, editors, Principles of Data Mining and Knowledge Discovery, 4th European Conference, PKDD, pages 265-276, Lyon, France, September, 13-16 2000. Springer Verlag Lecture Notes in Computer Science 1920.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. I. Hall, L. O. Özyurt and J. Bezdek. Clustering with a genetically optimized approach. IEEE Transactions on Evolutionary Computation, 3(2):103-112, July 1999.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Mateo, CA, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Hinneburg and D. Keim. An efficient approach to clustering in large multimedia databases with noise. In Proc. 4rd Int. Conf. on Knowledge Discovery and Data Mining, pages 58-65, New York, August 1998. AAAI Press.]]Google ScholarGoogle Scholar
  32. A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice-Hall, Inc., Englewood Cliffs, NJ, 1988. Advanced Reference Series: Computer Science.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Jain, A. K. nd Murty and P. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3):264-320, September 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Kalbfleisch. Probability and Statistical Inference --- Volume 2: Statistical Inference. Springer-Verlag, NY, US., second edition, 1985.]]Google ScholarGoogle Scholar
  35. G. Karypis, E.-H. Han, and V. Kumar. Chameleon: Hierachical clustering using dynamic modeling. Computer, 32(8):68-75, August 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. L. Kaufman and P. Rousseuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, NY, USA, 1990.]]Google ScholarGoogle Scholar
  37. W. Kloesgen and J. Zytkow. Machine discovery terminology. KDnuggets Publicatiosn and References http://www.kdnuggets.com/publications/index.html. http://orgwis.gmd.de/projects/explora/terms.html.]]Google ScholarGoogle Scholar
  38. H. Kuhn. A note on Fermat's problem. Mathematical Programming, 4(1):98-107, 1973.]]Google ScholarGoogle ScholarCross RefCross Ref
  39. H. Kuhn and E. Kuenne. An efficient algorithm for the numerical solution of the generalized Weber problem in spatial economics. Journal of Regional Science, 4(2):21-33,1962.]]Google ScholarGoogle ScholarCross RefCross Ref
  40. J. MacQueen. Some methods for classification and analysis of multivariate observations. In L. Le Cam and J. Neyman, editors, 5th Berkley Symposium on Mathematical Statistics and Probability, pages 281-297, 1967. Volume 1.]]Google ScholarGoogle Scholar
  41. R. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In J. Bocca, M. Jarke, and C. Zaniolo, editors, Proceedings of the 20th Conference on Very Large Data Bases (VLDB), pages 144-155, San Francisco, CA, 1994. Santiago, Chile, Morgan Kaufmann Publishers.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. N. Pal and J. Bezdel. On cluster validity for the fuzzy c-means model. IEEE Transactions on Fuzzy Systems, 3(3):370-379, August 1995.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. S. Ray and R. Turi. Determination of number of clusters in k-means clustering and application in colour image segmentation. In N. Pal, D, A. K., and J. Das, editors, Proceedings of the 4th International Conference on Advances in Pattern Recognition and Digital Techniques (ICAPRDT'99), pages 137-143, New Delhi, India, December 27-29 1999. Narosa Publishing House.]]Google ScholarGoogle Scholar
  44. R. Rezaee, B. Lelieveldt, and J. Reiber. A new cluster validity index for the fuzzy c-mean. Pattern Recognition Letters, 19:237-246, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Rissanen. Stochastic complexity. Journal of the Royal Statistical Society, Series B, 49(3):223-239, 1987.]]Google ScholarGoogle Scholar
  46. P. Rousseeuw and A. Leroy. Robust regression and outlier detection. John Wiley & Sons, NY, USA, 1987.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. M. Tanner. Tools for Statistical Inference. Springer-Verlag, NY, US., 1993.]]Google ScholarGoogle Scholar
  48. M. Teitz and P. Bart. Heuristic methods for estimating the generalized vertex median of a weighted graph. Operations Research, 16:955-961, 1968.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, NY, USA, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. D. Titterington, A. Smith, and U. Makov. Statistical Analysis of Finite Mixture Distributions. John Wiley & sons, UK, 1985.]]Google ScholarGoogle Scholar
  51. C. Wallace and D. Boulton. An information measure for classification. Computer Journal, 11:185-195, 1968.]]Google ScholarGoogle ScholarCross RefCross Ref
  52. R. Wilcox. Introduction to Robust Estimation and Hypothesis Testing. Academic Press, San Diego, CA, 1997.]]Google ScholarGoogle Scholar
  53. M. Windham. Cluster validity for the fuzzy c-means clustering algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 4(4):357-363, 1982.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. I. Witten and E. Frank. Data Mining --- Practical Machine Learning Tools and Technologies with JAVA implementations. Morgan Kaufmann Publishers, San Mateo, CA, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. O. Zamir and O. Etzioni. Web document clustering: a feasibility demonstration. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR'98), pages 46-54, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. SIGMOD Record, 25(2):103-114, June 1996. Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader