Abstract
High dimensionality of text can be a deterrent in applying complex learners such as Support Vector Machines to the task of text classification. Feature clustering is a powerful alternative to feature selection for reducing the dimensionality of text data. In this paper we propose a new information-theoretic divisive algorithm for feature/word clustering and apply it to text classification. Existing techniques for such "distributional clustering" of words are agglomerative in nature and result in (i) sub-optimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we first derive a global criterion for feature clustering. We then present a fast, divisive algorithm that monotonically decreases this objective function value. We show that our algorithm minimizes the "within-cluster Jensen-Shannon divergence" while simultaneously maximizing the "between-cluster Jensen-Shannon divergence". In comparison to the previously proposed agglomerative strategies our divisive algorithm is much faster and achieves comparable or higher classification accuracies. We further show that feature clustering is an effective technique for building smaller class models in hierarchical classification. We present detailed experimental results using Naive Bayes and Support Vector Machines on the 20Newsgroups data set and a 3-level hierarchy of HTML documents collected from the Open Directory project (www.dmoz.org).
- IEEE Standard for Binary Floating Point Arithmetic. ANSI/IEEE, New York, Std 754-1985 edition, 1985.Google Scholar
- L. D. Baker and A. McCallum. Distributional clustering of words for text classification. In SIGIR '98: Proceedings of the 21st Annual International ACM SIGIR, pages 96-103. ACM, August 1998. Google Scholar
- R. Bekkerman, R. El-Yaniv, Y. Winter, and N. Tishby. On feature distributional clustering for text categorization. In ACM SIGIR, pages 146-153, 2001. Google Scholar
- P. Berkhin and J. D. Becher. Learning simple relations: Theory and applications. In Proceedings of the The Second SIAM International Conference on Data Mining, pages 420-436, 2002.Google Scholar
- B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In COLT, pages 144-152, 1992. Google Scholar
- P. S. Bradley and O. L. Mangasarian. k-plane clustering. Journal of Global Optimization, 16(1): 23-32, 2000. Google Scholar
- S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In Proceedings of the 23rd VLDB Conference, Athens, Greece, 1997. Google Scholar
- T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, New York, USA, 1991. Google Scholar
- S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6): 391-407, 1990.Google Scholar
- I. S. Dhillon, Y. Guan, and J. Kogan. Iterative clustering of high dimensional text data augmented by local search. In Proceedings of the 2002 IEEE International Conference on Data Mining, 2002. To Appear. Google Scholar
- I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1): 143-175, January 2001. Google Scholar
- P. Domingos and M. J. Pazzani. On the the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2-3): 103-130, 1997. Google Scholar
- S. T. Dumais and H. Chen. Hierarchical classification of web content. In Proceedings of SIGIR- 00, 23rd ACM International Conference on Research and Development in Information Retrieval, pages 256-263, Athens, GR, 2000. ACM Press, New York, US. Google Scholar
- S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of CIKM-98, 7th ACM International Conference on Information and Knowledge Management, pages 148-155, Bethesda, US, 1998. ACM Press, New York, US. Google Scholar
- J. H. Friedman. On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1: 55-77, 1997. Google Scholar
- M. R. Garey, D. S. Johnson, and H. S.Witsenhausen. The complexity of the generalized Lloyd-Max problem. IEEE Trans. Inform. Theory, 28(2): 255-256, 1982.Google Scholar
- D. Goldberg. What every computer scientist should know about floating point arithmetic. ACM Computing Surveys, 23(1), 1991. Google Scholar
- R. M. Gray and D. L. Neuhoff. Quantization. IEEE Trans. Inform. Theory, 44(6): 1-63, 1998. Google Scholar
- T. Hofmann. Probabilistic latent semantic indexing. In Proc. ACM SIGIR. ACM Press, August 1999. Google Scholar
- T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398 in Lecture Notes in Computer Science, pages 137-142. Springer Verlag, Heidelberg, DE, 1998. Google Scholar
- D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML-97), 1997. Google Scholar
- S. Kullback and R. A. Leibler. On information and sufficiency. Ann. Math. Stat., 22: 79-86, 1951.Google Scholar
- K. Lang. News Weeder: Learning to filter netnews. In Proc. 12th Int'l Conf. Machine Learning, San Francisco, 1995.Google Scholar
- J. Lin. Divergence measures based on the Shannon entropy. IEEE Trans. Inform. Theory, 37(1): 145-151, 1991.Google Scholar
- A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.Google Scholar
- A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/mccallum/bow, 1996.Google Scholar
- T. M. Mitchell. Machine Learning. McGraw-Hill, 1997. Google Scholar
- T. M. Mitchell. Conditions for the equivalence of hierarchical and non-hierarchical bayesian classifiers. Technical report, Center for Automated Learning and Discovery, Carnegie-Mellon University, 1998.Google Scholar
- D. S. Modha and W. S. Spangler. Feature weighting in k-means clustering. Machine Learning, 2002, to appear. Google Scholar
- F. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. In 31st Annual Meeting of the ACL, pages 183-190, 1993. Google Scholar
- J. Rissanen. Stochastic Complexity in Statistical Inquiry. Series in Computer Science -Vol. 15. World Scientific, Singapore, 1989. Google Scholar
- G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983. Google Scholar
- C. E. Shannon. A mathematical theory of communication. Bell System Technical J., 27: 379-423, 1948.Google Scholar
- N. Slonim and N. Tishby. The power of word clusters for text classification. In 23rd European Colloquium on Information Retrieval Research (ECIR), 2001.Google Scholar
- N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368-377, 1999.Google Scholar
- S. Vaithyanathan and B. Dom. Model selection in unsupervised learning with applications to document clustering. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML), Bled, Slovenia. Morgan Kaufman, June 1999. Google Scholar
- V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995. Google Scholar
- J. Verbeek. An information theoretic approach to finding word groups for text classification. Master's thesis, Institute for Logic, Language and Computation (ILLC-MoL-2000-03), Amsterdam, The Netherlands, September 2000.Google Scholar
- Y. Yang and X. Liu. A re-examination of text categorization methods. In ACM SIGIR, pages 42-49, 1999. Google Scholar
- Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In Proc. 14th International Conference on Machine Learning, pages 412-420. Morgan Kaufmann, 1997. Google Scholar
Index Terms
- A divisive information theoretic feature clustering algorithm for text classification
Recommendations
Refining a divisive partitioning algorithm for unsupervised clustering
Design and application of hybrid intelligent systemsThe Principal Direction Divisive Partitioning (PDDP) algorithm is a fast and scalable clustering algorithm [3]. The basic idea is to recursively split the data set into sub-clusters based on principal direction vectors. However, the PDDP algorithm can ...
A fast divisive clustering algorithm using an improved discrete particle swarm optimizer
As an important technique for data analysis, clustering has been employed in many applications such as image segmentation, document clustering and vector quantization. Divisive clustering, which is a branch of hierarchical clustering, has been studied ...
An efficient hybrid clustering algorithm for molecular sequences classification
ACM-SE 44: Proceedings of the 44th annual Southeast regional conferenceThe k-means clustering and hierarchical agglomerative clustering algorithms are two popular methods to partition data into groups. The k-means clustering algorithm heavily favors spherical clusters and does not deal with noise adequately. To overcome ...
Comments