skip to main content
10.1145/564376.564412acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Document clustering with committees

Published:11 August 2002Publication History

ABSTRACT

Document clustering is useful in many information retrieval tasks: document browsing, organization and viewing of retrieval results, generation of Yahoo-like hierarchies of documents, etc. The general goal of clustering is to group data elements such that the intra-group similarities are high and the inter-group similarities are low. We present a clustering algorithm called CBC (Clustering By Committee) that is shown to produce higher quality clusters in document clustering tasks as compared to several well known clustering algorithms. It initially discovers a set of tight clusters (high intra-group similarity), called committees, that are well scattered in the similarity space (low inter-group similarity). The union of the committees is but a subset of all elements. The algorithm proceeds by assigning elements to their most similar committee. Evaluating cluster quality has always been a difficult task. We present a new evaluation methodology that is based on the editing distance between output clusters and manually constructed classes (the answer key). This evaluation measure is more intuitive and easier to interpret than previous evaluation measures.

References

  1. Buckley, C. and Lewit, A. F. 1985. Optimization of inverted vector searches. In Proceedings of SIGIR-85. pp. 97--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Church, K. and Hanks, P. 1989. Word association norms, mutual information, and lexicography. In Proceedings of ACL-89. pp. 76--83. Vancouver, Canada. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Cutting, D. R.; Karger, D.; Pedersen, J.; and Tukey, J. W. 1992. Scatter/Gather: A cluster-based approach to browsing large document collections. In Proceedings of SIGIR-92. pp. 318--329. Copenhagen, Denmark. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Guha, S.; Rastogi, R.; and Kyuseok, S. 1999. ROCK: A robust clustering algorithm for categorical attributes. In Proceedings of ICDE'99. pp. 512--521. Sydney, Australia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Han, J. and Kamber, M. 2001. Data Mining - Concepts and Techniques. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Hearst, M. A. and Pedersen, J. O. 1996. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of SIGIR-96. pp. 76-84. Zurich, Switzerland. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jain, A. K.; Murty, M. N.; and Flynn, P. J. 1999. Data Clustering: A Review. ACM Computing Surveys 31(3):264--323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jardine, N. and van Rijsbergen, C. J. 1971. The use of hierarchical clustering in information retrieval. Information Storage and Retreival, 7:217--240.Google ScholarGoogle ScholarCross RefCross Ref
  9. Karypis, G.; Han, E.-H.; and Kumar, V. 1999. Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer: Special Issue on Data Analysis and Mining 32(8):68--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kaufmann, L. and Rousseeuw, P. J. 1987. Clustering by means of medoids. In Dodge, Y. (Ed.) Statistical Data Analysis based on the L1 Norm. pp. 405-416. Elsevier/North Holland, Amsterdam.Google ScholarGoogle Scholar
  11. King, B. 1967. Step-wise clustering procedures. Journal of the American Statistical Association, 69:86--101.Google ScholarGoogle ScholarCross RefCross Ref
  12. Koller, D. and Sahami, M. 1997. Hierarchically classifying documents using very few words. In Proceedings of ICML-97. pp. 170--176. Nashville, TN. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. McQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematics, Statistics and Probability, 1:281--298.Google ScholarGoogle Scholar
  14. Porter, M. F. 1980. An algorithm for suffix stripping. In Proceedings of SIGIR-80. pp. 318--327.Google ScholarGoogle ScholarCross RefCross Ref
  15. Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw Hill. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Sneath, P. H. A. and Sokal, R. R. 1973. Numerical Taxonomy: The Principles and Practice of Numerical Classification. Freeman. London, UK.Google ScholarGoogle Scholar
  17. Steinbach, M.; Karypis, G.; and Kumar, V. 2000. A comparison of document clustering techniques. Technical Report #00--034. Department of Computer Science and Engineering, University of Minnesota.Google ScholarGoogle Scholar
  18. van Rijsbergen, C. J. 1979. Information Retrieval, second edition. London: Buttersworth. Available at: http://www.dcs.gla.ac.uk/Keith/Preface.html Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Wagstaff, K. and Cardie, C. 2000. Clustering with instance-level constraints. In Proceedings of ICML-2000. pp. 1103--1110. Palo Alto, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Document clustering with committees

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
        August 2002
        478 pages
        ISBN:1581135610
        DOI:10.1145/564376

        Copyright © 2002 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 11 August 2002

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        SIGIR '02 Paper Acceptance Rate44of219submissions,20%Overall Acceptance Rate792of3,983submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader