skip to main content
10.1145/1150402.1150455acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Extracting key-substring-group features for text classification

Published:20 August 2006Publication History

ABSTRACT

In many text classification applications, it is appealing to take every document as a string of characters rather than a bag of words. Previous research studies in this area mostly focused on different variants of generative Markov chain models. Although discriminative machine learning methods like Support Vector Machine (SVM) have been quite successful in text classification with word features, it is neither effective nor efficient to apply them straightforwardly taking all substrings in the corpus as features. In this paper, we propose to partition all substrings into statistical equivalence groups, and then pick those groups which are important (in the statistical sense) as features (named key-substring-group features) for text classification. In particular, we propose a suffix tree based algorithm that can extract such features in linear time (with respect to the total number of characters in the corpus). Our experiments on English, Chinese and Greek datasets show that SVM with key-substring-group features can achieve outstanding performance for various text classification tasks.

References

  1. A. N. Aizawa. Linguistic techniques to improve the performance of automatic text categorization. In Proceedings of the 6th Natural Language Processing Pacific Rim Symposim (NLPRS), pages 307--314, Tokyo, Japan, 2001.]]Google ScholarGoogle Scholar
  2. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. C. Bell, J. G. Cleary, and I. H. Witten. Text Compression. Prentice-Hall, 1990.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. A. Bender and M. Farach-Colton. The LCA problem revisited. In Proceedings of the 4th Latin American Symposium on Theoretical Informatics (LATIN), pages 88--94, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Bratko and B. Filipic. Spam filtering using compression models. Technical Report IJS-DP-9227, Department of Intelligent Systems, Jozef Stefan Institute, Ljubljana, Slovenia, 2005.]]Google ScholarGoogle Scholar
  6. N. Cancedda, E. Gaussier, C. Goutte, and J.-M. Renders. Word-sequence kernels. Journal of Machine Learning Research, 3(Feb):1059--1082, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001.]]Google ScholarGoogle Scholar
  8. C.-H. Chang and S.-C. Lui. IEPAD: Information extraction based on pattern discovery. In Proceedings of the 10th International Conference on World Wide Web (WWW), pages 681--688, Hong Kong, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W. I. Chang and E. L. Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4/5):327--344, 1994.]]Google ScholarGoogle ScholarCross RefCross Ref
  10. S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics (ACL), pages 310--318, Morristown, NJ, USA, 1996. Association for Computational Linguistics.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. L.-F. Chien. PAT-tree-based keyword extraction for Chinese information retrieval. In Proceedings of the 20th Annual ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 50--58, Philadelphia, PA, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. W. Church and P. Hanks. Word association norms, mutual information and lexicography. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics (ACL), pages 76--83, Vancouver, BC, Canada, 1989.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. G. Cleary and W. J. Teahan. Unbounded length contexts for PPM. The Comput Journal, 40(2-3):67--75, 1997.]]Google ScholarGoogle ScholarCross RefCross Ref
  14. J. G. Cleary and I. H. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communication, 32(4):396--402, 1984.]]Google ScholarGoogle ScholarCross RefCross Ref
  15. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, Cambridge, UK, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. R. Cutting, J. O. Pedersen, D. R. Karger, and J. W. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 318--329, Copenhagen, Denmark, 1992.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th ACM International Conference on Information and Knowledge Management (CIKM), pages 148--155, Bethesda, MD, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. Frank, C. Chui, and I. H. Witten. Text categorization using compression models. In Proceedings of the Data Compression Conference (DCC), page 555, Snowbird, UT, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. Gabrilovich and S. Markovitch. Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. In Proceedings of the 21st International Conference on Machine Learning (ICML), pages 321--328, Banff, Alberta, Canada, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Goodman. A bit of progress in language modeling, extended version. Technical report, Microsoft Research, 2001.]]Google ScholarGoogle Scholar
  21. D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. He, A.-H. Tan, and C.-L. Tan. A comparative study on Chinese text categorization methods. In PRICAI'2000 International Workshop on Text and Web Mining, pages 24--35, Melbourne, Australia, 2000.]]Google ScholarGoogle Scholar
  23. J. He, A.-H. Tan, and C. L. Tan. On machine learning methods for Chinese document categorization. Applied Intelligence, 18(3):311--322, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press, Cambridge, MA, USA, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Holmes and R. Forsyth. The federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing, 10(2):111--127, 1995.]]Google ScholarGoogle ScholarCross RefCross Ref
  26. C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13(2):415--425, 2002.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. Jackson and I. Moulinier. Natural Language Processing for Online Applications: Text Retrieval, Extraction, and Categorization. John Benjamins Publishing Co, 2002.]]Google ScholarGoogle ScholarCross RefCross Ref
  28. T. Jebara, R. Kondor, and A. Howard. Probability product kernels. Journal of Machine Learning Research, 5:819--844, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning (ECML), pages 137--142, Chemnitz, Germany, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. T. Joachims. A statistical learning model of text classification for support vector machines. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 128--136, New Orleans, LA, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. T. Joachims. Learning to Classify Text using Support Vector Machines. Kluwer, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. B. Kessler, G. Nunberg, and H. Schutze. Automatic detection of text genre. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL) and 8th Conference of the European Chapter of the Association for Computational Linguistics (ECACL), pages 32--38, Madrid, Spain, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. D. Knuth. The Art of Computer Programming. Addison-Wesley, 3rd edition, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. D. Lafferty and G. Lebanon. Diffusion kernels on statistical manifolds. Journal of Machine Learning Research, 6(Jan):129--163, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. D. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 111--119, Orleans, LA, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M.-J. Lee and L.-F. Chien. Automatic acquisition of phrasal knowledge for English-Chinese bilingual information retrieval. In Proceedings of the 21st Annual ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 351--352, Melbourne, Australia, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Y.-B. Lee and S. H. Myaeng. Text genre classification with genre-revealing and subject-revealing features. In Proceedings of The 25th Annual International ACM SIGIR Conference on Research and Development in Information (SIGIR), pages 145--150, Tampere, Finland, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. C. S. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In Proceedings of the 7th Pacific Symposium on Biocomputing (PSB), pages 566--575, Lihue, HI, 2002.]]Google ScholarGoogle Scholar
  40. D. D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In Proceedings of the 10th European Conference on Machine Learning (ECML), pages 4--15, Chemnitz, Germany, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification using string kernels. Journal of Machine Learning Research (JMLR), 2(Feb):419--444, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. C. Manning and H. Schutze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Y. Marton, N. Wu, and L. Hellerstein. On compression-based text classification. In Proceedings of the 27th European Conference on IR Research (ECIR), pages 300--314, Santiago de Compostela, Spain, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. T. Mitchell. Machine Learning. McGraw Hill, international edition, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. R. M. Pampapathi, B. Mirkin, and M. Levene. A suffix tree approach to email filtering, 2005.]]Google ScholarGoogle Scholar
  46. A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw Hill, New York, 2nd edition, 1984.]]Google ScholarGoogle Scholar
  47. F. Peng, D. Schuurmans, and S. Wang. Augmenting Naive Bayes text classifier with statistical language models. Information Retrieval, 7(3-4):317--345, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 275--281, Melbourne, Australia, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. D. Ron, Y. Singer, and N. Tishby. The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning, 25(2-3):117--149, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. R. Rosenfeld. Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8):1270--1278, 2000.]]Google ScholarGoogle ScholarCross RefCross Ref
  51. R. E. Schapire. The boosting approach to machine learning: An overview. In MSRI Workshop on Nonlinear Estimation and Classification, Berkeley, CA, 2002.]]Google ScholarGoogle Scholar
  52. R. E. Schapire and Y. Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135--168, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. B. Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.]]Google ScholarGoogle Scholar
  54. F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. N. Slonim, G. Bejerano, S. Fine, and N. Tishby. Discriminative feature selection via multiclass variable memory Markov model. In Proceedings of the 19th International Conference on Machine Learning (ICML), pages 578--585, Sydney, Australia, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. E. Stamatatos, N. Fakotakis, and G. Kokkinakis. Automatic text categorisation in terms of genre and author. Computational Linguistics, 26(4):471--495, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. W. J. Teahan and D. J. Harper. Using compression based language models for text categorization. In J. Callan, B. Croft, and J. Lafferty, editors, Workshop on Language Modeling and Information Retrieval, pages 83--88, Carnegie Mellon University, 2001.]]Google ScholarGoogle Scholar
  59. E. Ukkonen. On-line construction of suffix-trees. Algorithmica, 14:249--260, 1995.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. C. van Rijsbergen. Information Retrieval. Butterworths, London, UK, 2nd edition, 1979.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 2nd edition, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. S. Vishwanathan and A. Smola. Fast kernels for string and tree matching. In K. Tsuda, B. Scholkopf, and J. Vert, editors, Kernels and Bioinformatics. MIT Press, Cambridge, MA, 2004.]]Google ScholarGoogle Scholar
  63. I. H. Witten. Applications of lossless compression in adaptive text mining. In Proceedings of the 34th Annual Conference on Information Sciences and Systems (CISS), Princeton University, New Jersey, 2000.]]Google ScholarGoogle Scholar
  64. I. H. Witten, Z. Bray, M. Mahoui, and W. J. Teahan. Text mining: A new frontier for lossless compression. In Proceedings of the 1999 Data Compression Conference (DCC), pages 198--207, Snowbird, Utah, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. M. Yamamoto and K. W. Church. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27(1):1--30, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. J. Yang and W. Wang. CLUSEQ: Efficient and effective sequence clustering. In Proceedings of the 19th International Conference on Data Engineering (ICDE), pages 101--112, Bangalore, India, 2003.]]Google ScholarGoogle ScholarCross RefCross Ref
  67. Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 42--49, Berkeley, CA, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML), pages 412--420, Nashville, TN, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. O. Zamir and O. Etzioni. Grouper: A dynamic clustering interface to web search results. Computer Networks, 31(11-16):1361--1374, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. C. Zhai. Risk Minimization and Language Modeling in Information Retrieval. PhD thesis, Carnegie Mellon University, 2002.]]Google ScholarGoogle Scholar
  71. C. Zhai and J. D. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22(2):179--214, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. D. Zhang, X. Chen, and W. S. Lee Text classification with kernels on the multinomial manifold. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 266--273, Salvador, Brazil, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. D. Zhang and Y. Dong. Semantic, hierarchical, online clustering of web search results. In Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China, 2004.]]Google ScholarGoogle ScholarCross RefCross Ref
  74. G. K. Zipf. Human Behaviour and the Principle of Least Effort. Addison-Wesley, Cambridge, MA, 1949.]]Google ScholarGoogle Scholar

Index Terms

  1. Extracting key-substring-group features for text classification

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
              August 2006
              986 pages
              ISBN:1595933395
              DOI:10.1145/1150402

              Copyright © 2006 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 20 August 2006

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • Article

              Acceptance Rates

              Overall Acceptance Rate1,133of8,635submissions,13%

              Upcoming Conference

              KDD '24

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader