ABSTRACT
Document clustering is useful in many information retrieval tasks: document browsing, organization and viewing of retrieval results, generation of Yahoo-like hierarchies of documents, etc. The general goal of clustering is to group data elements such that the intra-group similarities are high and the inter-group similarities are low. We present a clustering algorithm called CBC (Clustering By Committee) that is shown to produce higher quality clusters in document clustering tasks as compared to several well known clustering algorithms. It initially discovers a set of tight clusters (high intra-group similarity), called committees, that are well scattered in the similarity space (low inter-group similarity). The union of the committees is but a subset of all elements. The algorithm proceeds by assigning elements to their most similar committee. Evaluating cluster quality has always been a difficult task. We present a new evaluation methodology that is based on the editing distance between output clusters and manually constructed classes (the answer key). This evaluation measure is more intuitive and easier to interpret than previous evaluation measures.
- Buckley, C. and Lewit, A. F. 1985. Optimization of inverted vector searches. In Proceedings of SIGIR-85. pp. 97--110. Google ScholarDigital Library
- Church, K. and Hanks, P. 1989. Word association norms, mutual information, and lexicography. In Proceedings of ACL-89. pp. 76--83. Vancouver, Canada. Google ScholarDigital Library
- Cutting, D. R.; Karger, D.; Pedersen, J.; and Tukey, J. W. 1992. Scatter/Gather: A cluster-based approach to browsing large document collections. In Proceedings of SIGIR-92. pp. 318--329. Copenhagen, Denmark. Google ScholarDigital Library
- Guha, S.; Rastogi, R.; and Kyuseok, S. 1999. ROCK: A robust clustering algorithm for categorical attributes. In Proceedings of ICDE'99. pp. 512--521. Sydney, Australia. Google ScholarDigital Library
- Han, J. and Kamber, M. 2001. Data Mining - Concepts and Techniques. Morgan Kaufmann. Google ScholarDigital Library
- Hearst, M. A. and Pedersen, J. O. 1996. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of SIGIR-96. pp. 76-84. Zurich, Switzerland. Google ScholarDigital Library
- Jain, A. K.; Murty, M. N.; and Flynn, P. J. 1999. Data Clustering: A Review. ACM Computing Surveys 31(3):264--323. Google ScholarDigital Library
- Jardine, N. and van Rijsbergen, C. J. 1971. The use of hierarchical clustering in information retrieval. Information Storage and Retreival, 7:217--240.Google ScholarCross Ref
- Karypis, G.; Han, E.-H.; and Kumar, V. 1999. Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer: Special Issue on Data Analysis and Mining 32(8):68--75. Google ScholarDigital Library
- Kaufmann, L. and Rousseeuw, P. J. 1987. Clustering by means of medoids. In Dodge, Y. (Ed.) Statistical Data Analysis based on the L1 Norm. pp. 405-416. Elsevier/North Holland, Amsterdam.Google Scholar
- King, B. 1967. Step-wise clustering procedures. Journal of the American Statistical Association, 69:86--101.Google ScholarCross Ref
- Koller, D. and Sahami, M. 1997. Hierarchically classifying documents using very few words. In Proceedings of ICML-97. pp. 170--176. Nashville, TN. Google ScholarDigital Library
- McQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematics, Statistics and Probability, 1:281--298.Google Scholar
- Porter, M. F. 1980. An algorithm for suffix stripping. In Proceedings of SIGIR-80. pp. 318--327.Google ScholarCross Ref
- Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw Hill. Google ScholarDigital Library
- Sneath, P. H. A. and Sokal, R. R. 1973. Numerical Taxonomy: The Principles and Practice of Numerical Classification. Freeman. London, UK.Google Scholar
- Steinbach, M.; Karypis, G.; and Kumar, V. 2000. A comparison of document clustering techniques. Technical Report #00--034. Department of Computer Science and Engineering, University of Minnesota.Google Scholar
- van Rijsbergen, C. J. 1979. Information Retrieval, second edition. London: Buttersworth. Available at: http://www.dcs.gla.ac.uk/Keith/Preface.html Google ScholarDigital Library
- Wagstaff, K. and Cardie, C. 2000. Clustering with instance-level constraints. In Proceedings of ICML-2000. pp. 1103--1110. Palo Alto, CA. Google ScholarDigital Library
Index Terms
- Document clustering with committees
Recommendations
Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global InformatizationIn this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Text document clustering based on neighbors
Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of k-means, are reported performing well on document clustering. They treat the clustering problem as ...
Efficient stochastic algorithms for document clustering
Clustering has become an increasingly important and highly complicated research area for targeting useful and relevant information in modern application domains such as the World Wide Web. Recent studies have shown that the most commonly used ...
Comments