ABSTRACT
In many text classification applications, it is appealing to take every document as a string of characters rather than a bag of words. Previous research studies in this area mostly focused on different variants of generative Markov chain models. Although discriminative machine learning methods like Support Vector Machine (SVM) have been quite successful in text classification with word features, it is neither effective nor efficient to apply them straightforwardly taking all substrings in the corpus as features. In this paper, we propose to partition all substrings into statistical equivalence groups, and then pick those groups which are important (in the statistical sense) as features (named key-substring-group features) for text classification. In particular, we propose a suffix tree based algorithm that can extract such features in linear time (with respect to the total number of characters in the corpus). Our experiments on English, Chinese and Greek datasets show that SVM with key-substring-group features can achieve outstanding performance for various text classification tasks.
- A. N. Aizawa. Linguistic techniques to improve the performance of automatic text categorization. In Proceedings of the 6th Natural Language Processing Pacific Rim Symposim (NLPRS), pages 307--314, Tokyo, Japan, 2001.]]Google Scholar
- R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.]] Google ScholarDigital Library
- T. C. Bell, J. G. Cleary, and I. H. Witten. Text Compression. Prentice-Hall, 1990.]] Google ScholarDigital Library
- M. A. Bender and M. Farach-Colton. The LCA problem revisited. In Proceedings of the 4th Latin American Symposium on Theoretical Informatics (LATIN), pages 88--94, 2000.]] Google ScholarDigital Library
- A. Bratko and B. Filipic. Spam filtering using compression models. Technical Report IJS-DP-9227, Department of Intelligent Systems, Jozef Stefan Institute, Ljubljana, Slovenia, 2005.]]Google Scholar
- N. Cancedda, E. Gaussier, C. Goutte, and J.-M. Renders. Word-sequence kernels. Journal of Machine Learning Research, 3(Feb):1059--1082, 2003.]] Google ScholarDigital Library
- C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001.]]Google Scholar
- C.-H. Chang and S.-C. Lui. IEPAD: Information extraction based on pattern discovery. In Proceedings of the 10th International Conference on World Wide Web (WWW), pages 681--688, Hong Kong, 2001.]] Google ScholarDigital Library
- W. I. Chang and E. L. Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4/5):327--344, 1994.]]Google ScholarCross Ref
- S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics (ACL), pages 310--318, Morristown, NJ, USA, 1996. Association for Computational Linguistics.]] Google ScholarDigital Library
- L.-F. Chien. PAT-tree-based keyword extraction for Chinese information retrieval. In Proceedings of the 20th Annual ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 50--58, Philadelphia, PA, 1997.]] Google ScholarDigital Library
- K. W. Church and P. Hanks. Word association norms, mutual information and lexicography. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics (ACL), pages 76--83, Vancouver, BC, Canada, 1989.]] Google ScholarDigital Library
- J. G. Cleary and W. J. Teahan. Unbounded length contexts for PPM. The Comput Journal, 40(2-3):67--75, 1997.]]Google ScholarCross Ref
- J. G. Cleary and I. H. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communication, 32(4):396--402, 1984.]]Google ScholarCross Ref
- N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, Cambridge, UK, 2000.]] Google ScholarDigital Library
- D. R. Cutting, J. O. Pedersen, D. R. Karger, and J. W. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 318--329, Copenhagen, Denmark, 1992.]] Google ScholarDigital Library
- S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th ACM International Conference on Information and Knowledge Management (CIKM), pages 148--155, Bethesda, MD, 1998.]] Google ScholarDigital Library
- E. Frank, C. Chui, and I. H. Witten. Text categorization using compression models. In Proceedings of the Data Compression Conference (DCC), page 555, Snowbird, UT, 2000.]] Google ScholarDigital Library
- E. Gabrilovich and S. Markovitch. Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. In Proceedings of the 21st International Conference on Machine Learning (ICML), pages 321--328, Banff, Alberta, Canada, 2004.]] Google ScholarDigital Library
- J. Goodman. A bit of progress in language modeling, extended version. Technical report, Microsoft Research, 2001.]]Google Scholar
- D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997.]] Google ScholarDigital Library
- J. He, A.-H. Tan, and C.-L. Tan. A comparative study on Chinese text categorization methods. In PRICAI'2000 International Workshop on Text and Web Mining, pages 24--35, Melbourne, Australia, 2000.]]Google Scholar
- J. He, A.-H. Tan, and C. L. Tan. On machine learning methods for Chinese document categorization. Applied Intelligence, 18(3):311--322, 2003.]] Google ScholarDigital Library
- R. Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press, Cambridge, MA, USA, 2001.]] Google ScholarDigital Library
- D. Holmes and R. Forsyth. The federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing, 10(2):111--127, 1995.]]Google ScholarCross Ref
- C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13(2):415--425, 2002.]]Google ScholarDigital Library
- P. Jackson and I. Moulinier. Natural Language Processing for Online Applications: Text Retrieval, Extraction, and Categorization. John Benjamins Publishing Co, 2002.]]Google ScholarCross Ref
- T. Jebara, R. Kondor, and A. Howard. Probability product kernels. Journal of Machine Learning Research, 5:819--844, 2004.]] Google ScholarDigital Library
- T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning (ECML), pages 137--142, Chemnitz, Germany, 1998.]] Google ScholarDigital Library
- T. Joachims. A statistical learning model of text classification for support vector machines. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 128--136, New Orleans, LA, 2001.]] Google ScholarDigital Library
- T. Joachims. Learning to Classify Text using Support Vector Machines. Kluwer, 2002.]] Google ScholarDigital Library
- D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall, 2000.]] Google ScholarDigital Library
- B. Kessler, G. Nunberg, and H. Schutze. Automatic detection of text genre. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL) and 8th Conference of the European Chapter of the Association for Computational Linguistics (ECACL), pages 32--38, Madrid, Spain, 1997.]] Google ScholarDigital Library
- D. Knuth. The Art of Computer Programming. Addison-Wesley, 3rd edition, 1997.]] Google ScholarDigital Library
- J. D. Lafferty and G. Lebanon. Diffusion kernels on statistical manifolds. Journal of Machine Learning Research, 6(Jan):129--163, 2005.]] Google ScholarDigital Library
- J. D. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 111--119, Orleans, LA, 2001.]] Google ScholarDigital Library
- M.-J. Lee and L.-F. Chien. Automatic acquisition of phrasal knowledge for English-Chinese bilingual information retrieval. In Proceedings of the 21st Annual ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 351--352, Melbourne, Australia, 1998.]] Google ScholarDigital Library
- Y.-B. Lee and S. H. Myaeng. Text genre classification with genre-revealing and subject-revealing features. In Proceedings of The 25th Annual International ACM SIGIR Conference on Research and Development in Information (SIGIR), pages 145--150, Tampere, Finland, 2002.]] Google ScholarDigital Library
- C. S. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In Proceedings of the 7th Pacific Symposium on Biocomputing (PSB), pages 566--575, Lihue, HI, 2002.]]Google Scholar
- D. D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In Proceedings of the 10th European Conference on Machine Learning (ECML), pages 4--15, Chemnitz, Germany, 1998.]] Google ScholarDigital Library
- H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification using string kernels. Journal of Machine Learning Research (JMLR), 2(Feb):419--444, 2001.]] Google ScholarDigital Library
- C. Manning and H. Schutze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, 1999.]] Google ScholarDigital Library
- Y. Marton, N. Wu, and L. Hellerstein. On compression-based text classification. In Proceedings of the 27th European Conference on IR Research (ECIR), pages 300--314, Santiago de Compostela, Spain, 2005.]] Google ScholarDigital Library
- T. Mitchell. Machine Learning. McGraw Hill, international edition, 1997.]] Google ScholarDigital Library
- R. M. Pampapathi, B. Mirkin, and M. Levene. A suffix tree approach to email filtering, 2005.]]Google Scholar
- A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw Hill, New York, 2nd edition, 1984.]]Google Scholar
- F. Peng, D. Schuurmans, and S. Wang. Augmenting Naive Bayes text classifier with statistical language models. Information Retrieval, 7(3-4):317--345, 2004.]] Google ScholarDigital Library
- J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 275--281, Melbourne, Australia, 1998.]] Google ScholarDigital Library
- D. Ron, Y. Singer, and N. Tishby. The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning, 25(2-3):117--149, 1996.]] Google ScholarDigital Library
- R. Rosenfeld. Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8):1270--1278, 2000.]]Google ScholarCross Ref
- R. E. Schapire. The boosting approach to machine learning: An overview. In MSRI Workshop on Nonlinear Estimation and Classification, Berkeley, CA, 2002.]]Google Scholar
- R. E. Schapire and Y. Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135--168, 2000.]] Google ScholarDigital Library
- B. Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.]]Google Scholar
- F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002.]] Google ScholarDigital Library
- J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.]] Google ScholarDigital Library
- N. Slonim, G. Bejerano, S. Fine, and N. Tishby. Discriminative feature selection via multiclass variable memory Markov model. In Proceedings of the 19th International Conference on Machine Learning (ICML), pages 578--585, Sydney, Australia, 2002.]] Google ScholarDigital Library
- E. Stamatatos, N. Fakotakis, and G. Kokkinakis. Automatic text categorisation in terms of genre and author. Computational Linguistics, 26(4):471--495, 2000.]] Google ScholarDigital Library
- W. J. Teahan and D. J. Harper. Using compression based language models for text categorization. In J. Callan, B. Croft, and J. Lafferty, editors, Workshop on Language Modeling and Information Retrieval, pages 83--88, Carnegie Mellon University, 2001.]]Google Scholar
- E. Ukkonen. On-line construction of suffix-trees. Algorithmica, 14:249--260, 1995.]]Google ScholarDigital Library
- C. van Rijsbergen. Information Retrieval. Butterworths, London, UK, 2nd edition, 1979.]] Google ScholarDigital Library
- V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 2nd edition, 2000.]] Google ScholarDigital Library
- S. Vishwanathan and A. Smola. Fast kernels for string and tree matching. In K. Tsuda, B. Scholkopf, and J. Vert, editors, Kernels and Bioinformatics. MIT Press, Cambridge, MA, 2004.]]Google Scholar
- I. H. Witten. Applications of lossless compression in adaptive text mining. In Proceedings of the 34th Annual Conference on Information Sciences and Systems (CISS), Princeton University, New Jersey, 2000.]]Google Scholar
- I. H. Witten, Z. Bray, M. Mahoui, and W. J. Teahan. Text mining: A new frontier for lossless compression. In Proceedings of the 1999 Data Compression Conference (DCC), pages 198--207, Snowbird, Utah, 1999.]] Google ScholarDigital Library
- M. Yamamoto and K. W. Church. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27(1):1--30, 2001.]] Google ScholarDigital Library
- J. Yang and W. Wang. CLUSEQ: Efficient and effective sequence clustering. In Proceedings of the 19th International Conference on Data Engineering (ICDE), pages 101--112, Bangalore, India, 2003.]]Google ScholarCross Ref
- Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 42--49, Berkeley, CA, 1999.]] Google ScholarDigital Library
- Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML), pages 412--420, Nashville, TN, 1997.]] Google ScholarDigital Library
- O. Zamir and O. Etzioni. Grouper: A dynamic clustering interface to web search results. Computer Networks, 31(11-16):1361--1374, 1999.]] Google ScholarDigital Library
- C. Zhai. Risk Minimization and Language Modeling in Information Retrieval. PhD thesis, Carnegie Mellon University, 2002.]]Google Scholar
- C. Zhai and J. D. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22(2):179--214, 2004.]] Google ScholarDigital Library
- D. Zhang, X. Chen, and W. S. Lee Text classification with kernels on the multinomial manifold. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 266--273, Salvador, Brazil, 2005.]] Google ScholarDigital Library
- D. Zhang and Y. Dong. Semantic, hierarchical, online clustering of web search results. In Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China, 2004.]]Google ScholarCross Ref
- G. K. Zipf. Human Behaviour and the Principle of Least Effort. Addison-Wesley, Cambridge, MA, 1949.]]Google Scholar
Index Terms
- Extracting key-substring-group features for text classification
Recommendations
Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values
Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...
Urdu text classification
FIT '09: Proceedings of the 7th International Conference on Frontiers of Information TechnologyThis paper compares statistical techniques for text classification using Naïve Bayes and Support Vector Machines, in context of Urdu language. A large corpus is used for training and testing purpose of the classifiers. However, those classifiers cannot ...
Boosting to correct inductive bias in text classification
CIKM '02: Proceedings of the eleventh international conference on Information and knowledge managementThis paper studies the effects of boosting in the context of different classification methods for text categorization, including Decision Trees, Naive Bayes, Support Vector Machines (SVMs) and a Rocchio-style classifier. We identify the inductive biases ...
Comments