skip to main content
article
Free Access

A divisive information theoretic feature clustering algorithm for text classification

Authors Info & Claims
Published:01 March 2003Publication History
Skip Abstract Section

Abstract

High dimensionality of text can be a deterrent in applying complex learners such as Support Vector Machines to the task of text classification. Feature clustering is a powerful alternative to feature selection for reducing the dimensionality of text data. In this paper we propose a new information-theoretic divisive algorithm for feature/word clustering and apply it to text classification. Existing techniques for such "distributional clustering" of words are agglomerative in nature and result in (i) sub-optimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we first derive a global criterion for feature clustering. We then present a fast, divisive algorithm that monotonically decreases this objective function value. We show that our algorithm minimizes the "within-cluster Jensen-Shannon divergence" while simultaneously maximizing the "between-cluster Jensen-Shannon divergence". In comparison to the previously proposed agglomerative strategies our divisive algorithm is much faster and achieves comparable or higher classification accuracies. We further show that feature clustering is an effective technique for building smaller class models in hierarchical classification. We present detailed experimental results using Naive Bayes and Support Vector Machines on the 20Newsgroups data set and a 3-level hierarchy of HTML documents collected from the Open Directory project (www.dmoz.org).

References

  1. IEEE Standard for Binary Floating Point Arithmetic. ANSI/IEEE, New York, Std 754-1985 edition, 1985.Google ScholarGoogle Scholar
  2. L. D. Baker and A. McCallum. Distributional clustering of words for text classification. In SIGIR '98: Proceedings of the 21st Annual International ACM SIGIR, pages 96-103. ACM, August 1998. Google ScholarGoogle Scholar
  3. R. Bekkerman, R. El-Yaniv, Y. Winter, and N. Tishby. On feature distributional clustering for text categorization. In ACM SIGIR, pages 146-153, 2001. Google ScholarGoogle Scholar
  4. P. Berkhin and J. D. Becher. Learning simple relations: Theory and applications. In Proceedings of the The Second SIAM International Conference on Data Mining, pages 420-436, 2002.Google ScholarGoogle Scholar
  5. B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In COLT, pages 144-152, 1992. Google ScholarGoogle Scholar
  6. P. S. Bradley and O. L. Mangasarian. k-plane clustering. Journal of Global Optimization, 16(1): 23-32, 2000. Google ScholarGoogle Scholar
  7. S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In Proceedings of the 23rd VLDB Conference, Athens, Greece, 1997. Google ScholarGoogle Scholar
  8. T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, New York, USA, 1991. Google ScholarGoogle Scholar
  9. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6): 391-407, 1990.Google ScholarGoogle Scholar
  10. I. S. Dhillon, Y. Guan, and J. Kogan. Iterative clustering of high dimensional text data augmented by local search. In Proceedings of the 2002 IEEE International Conference on Data Mining, 2002. To Appear. Google ScholarGoogle Scholar
  11. I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1): 143-175, January 2001. Google ScholarGoogle Scholar
  12. P. Domingos and M. J. Pazzani. On the the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2-3): 103-130, 1997. Google ScholarGoogle Scholar
  13. S. T. Dumais and H. Chen. Hierarchical classification of web content. In Proceedings of SIGIR- 00, 23rd ACM International Conference on Research and Development in Information Retrieval, pages 256-263, Athens, GR, 2000. ACM Press, New York, US. Google ScholarGoogle Scholar
  14. S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of CIKM-98, 7th ACM International Conference on Information and Knowledge Management, pages 148-155, Bethesda, US, 1998. ACM Press, New York, US. Google ScholarGoogle Scholar
  15. J. H. Friedman. On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1: 55-77, 1997. Google ScholarGoogle Scholar
  16. M. R. Garey, D. S. Johnson, and H. S.Witsenhausen. The complexity of the generalized Lloyd-Max problem. IEEE Trans. Inform. Theory, 28(2): 255-256, 1982.Google ScholarGoogle Scholar
  17. D. Goldberg. What every computer scientist should know about floating point arithmetic. ACM Computing Surveys, 23(1), 1991. Google ScholarGoogle Scholar
  18. R. M. Gray and D. L. Neuhoff. Quantization. IEEE Trans. Inform. Theory, 44(6): 1-63, 1998. Google ScholarGoogle Scholar
  19. T. Hofmann. Probabilistic latent semantic indexing. In Proc. ACM SIGIR. ACM Press, August 1999. Google ScholarGoogle Scholar
  20. T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398 in Lecture Notes in Computer Science, pages 137-142. Springer Verlag, Heidelberg, DE, 1998. Google ScholarGoogle Scholar
  21. D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML-97), 1997. Google ScholarGoogle Scholar
  22. S. Kullback and R. A. Leibler. On information and sufficiency. Ann. Math. Stat., 22: 79-86, 1951.Google ScholarGoogle Scholar
  23. K. Lang. News Weeder: Learning to filter netnews. In Proc. 12th Int'l Conf. Machine Learning, San Francisco, 1995.Google ScholarGoogle Scholar
  24. J. Lin. Divergence measures based on the Shannon entropy. IEEE Trans. Inform. Theory, 37(1): 145-151, 1991.Google ScholarGoogle Scholar
  25. A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.Google ScholarGoogle Scholar
  26. A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/mccallum/bow, 1996.Google ScholarGoogle Scholar
  27. T. M. Mitchell. Machine Learning. McGraw-Hill, 1997. Google ScholarGoogle Scholar
  28. T. M. Mitchell. Conditions for the equivalence of hierarchical and non-hierarchical bayesian classifiers. Technical report, Center for Automated Learning and Discovery, Carnegie-Mellon University, 1998.Google ScholarGoogle Scholar
  29. D. S. Modha and W. S. Spangler. Feature weighting in k-means clustering. Machine Learning, 2002, to appear. Google ScholarGoogle Scholar
  30. F. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. In 31st Annual Meeting of the ACL, pages 183-190, 1993. Google ScholarGoogle Scholar
  31. J. Rissanen. Stochastic Complexity in Statistical Inquiry. Series in Computer Science -Vol. 15. World Scientific, Singapore, 1989. Google ScholarGoogle Scholar
  32. G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983. Google ScholarGoogle Scholar
  33. C. E. Shannon. A mathematical theory of communication. Bell System Technical J., 27: 379-423, 1948.Google ScholarGoogle Scholar
  34. N. Slonim and N. Tishby. The power of word clusters for text classification. In 23rd European Colloquium on Information Retrieval Research (ECIR), 2001.Google ScholarGoogle Scholar
  35. N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368-377, 1999.Google ScholarGoogle Scholar
  36. S. Vaithyanathan and B. Dom. Model selection in unsupervised learning with applications to document clustering. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML), Bled, Slovenia. Morgan Kaufman, June 1999. Google ScholarGoogle Scholar
  37. V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995. Google ScholarGoogle Scholar
  38. J. Verbeek. An information theoretic approach to finding word groups for text classification. Master's thesis, Institute for Logic, Language and Computation (ILLC-MoL-2000-03), Amsterdam, The Netherlands, September 2000.Google ScholarGoogle Scholar
  39. Y. Yang and X. Liu. A re-examination of text categorization methods. In ACM SIGIR, pages 42-49, 1999. Google ScholarGoogle Scholar
  40. Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In Proc. 14th International Conference on Machine Learning, pages 412-420. Morgan Kaufmann, 1997. Google ScholarGoogle Scholar

Index Terms

  1. A divisive information theoretic feature clustering algorithm for text classification

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image The Journal of Machine Learning Research
          The Journal of Machine Learning Research  Volume 3, Issue
          3/1/2003
          1437 pages
          ISSN:1532-4435
          EISSN:1533-7928
          Issue’s Table of Contents

          Publisher

          JMLR.org

          Publication History

          • Published: 1 March 2003
          Published in jmlr Volume 3, Issue

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader