Article

Document clustering with committees

Authors:
Patrick Pantel

University of Alberta, Edmonton, Alberta, Canada

University of Alberta, Edmonton, Alberta, Canada
View Profile

,
Dekang Lin

University of Alberta, Edmonton, Alberta, Canada

University of Alberta, Edmonton, Alberta, Canada
View Profile

SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrievalAugust 2002Pages 199–206https://doi.org/10.1145/564376.564412

Published:11 August 2002Publication History

SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 199–206

ABSTRACT

Document clustering is useful in many information retrieval tasks: document browsing, organization and viewing of retrieval results, generation of Yahoo-like hierarchies of documents, etc. The general goal of clustering is to group data elements such that the intra-group similarities are high and the inter-group similarities are low. We present a clustering algorithm called CBC (Clustering By Committee) that is shown to produce higher quality clusters in document clustering tasks as compared to several well known clustering algorithms. It initially discovers a set of tight clusters (high intra-group similarity), called committees, that are well scattered in the similarity space (low inter-group similarity). The union of the committees is but a subset of all elements. The algorithm proceeds by assigning elements to their most similar committee. Evaluating cluster quality has always been a difficult task. We present a new evaluation methodology that is based on the editing distance between output clusters and manually constructed classes (the answer key). This evaluation measure is more intuitive and easier to interpret than previous evaluation measures.

References

Buckley, C. and Lewit, A. F. 1985. Optimization of inverted vector searches. In Proceedings of SIGIR-85. pp. 97--110. Google ScholarDigital Library
Church, K. and Hanks, P. 1989. Word association norms, mutual information, and lexicography. In Proceedings of ACL-89. pp. 76--83. Vancouver, Canada. Google ScholarDigital Library
Cutting, D. R.; Karger, D.; Pedersen, J.; and Tukey, J. W. 1992. Scatter/Gather: A cluster-based approach to browsing large document collections. In Proceedings of SIGIR-92. pp. 318--329. Copenhagen, Denmark. Google ScholarDigital Library
Guha, S.; Rastogi, R.; and Kyuseok, S. 1999. ROCK: A robust clustering algorithm for categorical attributes. In Proceedings of ICDE'99. pp. 512--521. Sydney, Australia. Google ScholarDigital Library
Han, J. and Kamber, M. 2001. Data Mining - Concepts and Techniques. Morgan Kaufmann. Google ScholarDigital Library
Hearst, M. A. and Pedersen, J. O. 1996. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of SIGIR-96. pp. 76-84. Zurich, Switzerland. Google ScholarDigital Library
Jain, A. K.; Murty, M. N.; and Flynn, P. J. 1999. Data Clustering: A Review. ACM Computing Surveys 31(3):264--323. Google ScholarDigital Library
Jardine, N. and van Rijsbergen, C. J. 1971. The use of hierarchical clustering in information retrieval. Information Storage and Retreival, 7:217--240.Google ScholarCross Ref
Karypis, G.; Han, E.-H.; and Kumar, V. 1999. Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer: Special Issue on Data Analysis and Mining 32(8):68--75. Google ScholarDigital Library
Kaufmann, L. and Rousseeuw, P. J. 1987. Clustering by means of medoids. In Dodge, Y. (Ed.) Statistical Data Analysis based on the L1 Norm. pp. 405-416. Elsevier/North Holland, Amsterdam.Google Scholar
King, B. 1967. Step-wise clustering procedures. Journal of the American Statistical Association, 69:86--101.Google ScholarCross Ref
Koller, D. and Sahami, M. 1997. Hierarchically classifying documents using very few words. In Proceedings of ICML-97. pp. 170--176. Nashville, TN. Google ScholarDigital Library
McQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematics, Statistics and Probability, 1:281--298.Google Scholar
Porter, M. F. 1980. An algorithm for suffix stripping. In Proceedings of SIGIR-80. pp. 318--327.Google ScholarCross Ref
Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw Hill. Google ScholarDigital Library
Sneath, P. H. A. and Sokal, R. R. 1973. Numerical Taxonomy: The Principles and Practice of Numerical Classification. Freeman. London, UK.Google Scholar
Steinbach, M.; Karypis, G.; and Kumar, V. 2000. A comparison of document clustering techniques. Technical Report #00--034. Department of Computer Science and Engineering, University of Minnesota.Google Scholar
van Rijsbergen, C. J. 1979. Information Retrieval, second edition. London: Buttersworth. Available at: http://www.dcs.gla.ac.uk/Keith/Preface.html Google ScholarDigital Library
Wagstaff, K. and Cardie, C. 2000. Clustering with instance-level constraints. In Proceedings of ICML-2000. pp. 1103--1110. Palo Alto, CA. Google ScholarDigital Library

Index Terms

Document clustering with committees
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

In this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Read More
Text document clustering based on neighbors

Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of k-means, are reported performing well on document clustering. They treat the clustering problem as ...
Read More
Efficient stochastic algorithms for document clustering

Clustering has become an increasingly important and highly complicated research area for targeting useful and relevant information in modern application domains such as the World Wide Web. Recent studies have shown that the most commonly used ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
August 2002
478 pages
ISBN:1581135610
DOI:10.1145/564376
General Chair:
Kalervo Järvelin
University of Tampere, Finland
,
Program Chairs:
Micheline Beaulieu
University of Sheffield, UK
,
Ricardo Baeza-Yates
University of Chile, Chile
,
Sung Hyon Myaeng
Chungnam National University, Korea
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 August 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
document clustering
document representation
evaluation methodology
machine learning
Qualifiers
- Article
Conference

Acceptance Rates
SIGIR '02 Paper Acceptance Rate44of219submissions,20%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 89
  Total Citations
  View Citations
- 1,453
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Document clustering with committees

SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Hybrid Bisect K-Means Clustering Algorithm

Text document clustering based on neighbors

Efficient stochastic algorithms for document clustering