skip to main content
article

Machine learning in automated text categorization

Published:01 March 2002Publication History
Skip Abstract Section

Abstract

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

References

  1. AMATI,G.AND CRESTANI, F. 1999. Probabilistic learning for selective dissemination of information. Inform. Process. Man. 35, 5, 633-654.]] Google ScholarGoogle Scholar
  2. ANDROUTSOPOULOS, I., KOUTSIAS, J., CHANDRINOS,K.V., AND SPYROPOULOS, C. D. 2000. An experimental comparison of naive Bayesian and keywordbased anti-spam filtering with personal email messages. In Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval (Athens, Greece, 2000), 160-167.]] Google ScholarGoogle Scholar
  3. APTE, C., DAMERAU,F.J.,AND WEISS, S. M. 1994. Automated learning of decision rules for text categorization. ACM Trans. on Inform. Syst. 12, 3, 233-251.]] Google ScholarGoogle Scholar
  4. ATTARDI, G., DI MARCO,S.,AND SALVI, D. 1998. Categorization by context. J. Univers. Comput. Sci. 4, 9, 719-736.]]Google ScholarGoogle Scholar
  5. BAKER,L.D.AND MCCALLUM, A. K. 1998. Distributional clustering of words for text classification. In Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval (Melbourne, Australia, 1998), 96-103.]] Google ScholarGoogle Scholar
  6. BELKIN,N.J.AND CROFT, W. B. 1992. Information filtering and information retrieval: two sides of the same coin? Commun. ACM 35, 12, 29- 38.]] Google ScholarGoogle Scholar
  7. BIEBRICHER, P., FUHR, N., KNORZ, G., LUSTIG,G.,AND SCHWANTNER, M. 1988. The automatic indexing system AIR/PHYS. From research to application. In Proceedings of SIGIR-88, 11th ACM International Conference on Research and Development in Information Retrieval (Grenoble, France, 1988), 333-342. Also reprinted in Sparck Jones and Willett {1997}, pp. 513-517.]] Google ScholarGoogle Scholar
  8. BORKO,H.AND BERNICK, M. 1963. Automatic document classification. J. Assoc. Comput. Mach. 10, 2, 151-161.]] Google ScholarGoogle Scholar
  9. CAROPRESO,M.F.,MATWIN,S.,AND SEBASTIANI,F. 2001. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In Text Databases and Document Management: Theory and Practice,A.G. Chin, ed. Idea Group Publishing, Hershey, PA, 78-102.]] Google ScholarGoogle Scholar
  10. CAVNAR,W.B.AND TRENKLE, J. M. 1994. N-grambased text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Docu-ment Analysis and Information Retrieval (Las Vegas, NV, 1994), 161-175.]]Google ScholarGoogle Scholar
  11. CHAKRABARTI, S., DOM, B. E., AGRAWAL, R., AND RAGHAVAN, P. 1998a. Scalable feature selec-tion, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. J. Very Large Data Bases 7,3, 163-178.]] Google ScholarGoogle Scholar
  12. CHAKRABARTI, S., DOM,B.E.,AND INDYK, P. 1998b. Enhanced hypertext categorization using hyperlinks. In Proceedings of SIGMOD-98, ACM International Conference on Management of Data (Seattle, WA, 1998), 307-318.]] Google ScholarGoogle Scholar
  13. CLACK, C., FARRINGDON, J., LIDWELL,P.,AND YU,T. 1997. Autonomous document classification for business. In Proceedings of the 1st International Conference on Autonomous Agents (Marina del Rey, CA, 1997), 201-208.]] Google ScholarGoogle Scholar
  14. CLEVERDON, C. 1984. Optimizing convenient online access to bibliographic databases. Inform. Serv. Use 4, 1, 37-47. Also reprinted in Willett {1988}, pp. 32-41.]] Google ScholarGoogle Scholar
  15. COHEN, W. W. 1995a. Learning to classify English text with ILP methods. In Advances in Inductive Logic Programming, L. De Raedt, ed. IOS Press, Amsterdam, The Netherlands, 124-143.]]Google ScholarGoogle Scholar
  16. COHEN, W. W. 1995b. Text categorization and relational learning. In Proceedings of ICML-95, 12th International Conference on Machine Learning (Lake Tahoe, CA, 1995), 124-132.]]Google ScholarGoogle Scholar
  17. COHEN,W.W.AND HIRSH, H. 1998. Joins that generalize: text classification using WHIRL.InProceedings of KDD-98, 4th International Conference on Knowledge Discovery and Data Mining (New York, NY, 1998), 169-173.]]Google ScholarGoogle Scholar
  18. COHEN,W.W.AND SINGER, Y. 1999. Contextsensitive learning methods for text categorization. ACM Trans. Inform. Syst. 17, 2, 141- 173.]] Google ScholarGoogle Scholar
  19. COOPER, W. S. 1995. Some inconsistencies and misnomers in probabilistic information retrieval. ACM Trans. Inform. Syst. 13, 1, 100-111.]] Google ScholarGoogle Scholar
  20. CREECY, R. M., MASAND, B. M., SMITH,S.J.,AND WALTZ, D. L. 1992. Trading MIPS and memory for knowledge engineering: classifying census returns on the Connection Machine. Commun. ACM 35, 8, 48-63.]] Google ScholarGoogle Scholar
  21. CRESTANI, F., LALMAS, M., VAN RIJSBERGEN,C.J.,AND CAMPBELL, I. 1998. "Is this document rele-vant? : : : probably." A survey of probabilistic models in information retrieval. ACM Comput. Surv. 30, 4, 528-552.]] Google ScholarGoogle Scholar
  22. DAGAN, I., KAROV,Y.,AND ROTH, D. 1997. Mistakedriven learning in text categorization. In Proceedings of EMNLP-97, 2nd Conference on Empirical Methods in Natural Language Processing (Providence, RI, 1997), 55-63.]]Google ScholarGoogle Scholar
  23. DEERWESTER, S., DUMAIS,S.T.,FURNAS,G.W., LANDAUER, T. K., AND HARSHMAN, R. 1990. Indexing by latent semantic indexing. J. Amer. Soc. Inform. Sci. 41, 6, 391-407.]]Google ScholarGoogle Scholar
  24. DENOYER, L., ZARAGOZA, H., AND GALLINARI, P. 2001. HMM-based passage models for document classification and ranking. In Proceedings of ECIR- 01, 23rd European Colloquium on Information Retrieval Research (Darmstadt, Germany, 2001).]]Google ScholarGoogle Scholar
  25. DIAZ ESTEBAN, A., DE BUENAGA RODRIGUEZ, M., URENA LOPEZ,L.A.,AND GARCIA VEGA, M. 1998. Integrating linguistic resources in an uniform way for text classification tasks. In Proceedings of LREC-98, 1st International Conference on Language Resources and Evaluation (Grenada, Spain, 1998), 1197-1204.]]Google ScholarGoogle Scholar
  26. DOMINGOS,P.AND PAZZANI, M. J. 1997. On the the optimality of the simple Bayesian classifier under zero-one loss. Mach. Learn. 29, 2-3, 103-130.]] Google ScholarGoogle Scholar
  27. DRUCKER, H., VAPNIK,V.,AND WU, D. 1999. Automatic text categorization and its applications to text retrieval. IEEE Trans. Neural Netw. 10,5, 1048-1054.]]Google ScholarGoogle Scholar
  28. DUMAIS,S.T.AND CHEN, H. 2000. Hierarchical classification of Web content. In Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval (Athens, Greece, 2000), 256-263.]] Google ScholarGoogle Scholar
  29. DUMAIS,S.T.,PLATT, J., HECKERMAN,D.,AND SAHAMI, M. 1998. Inductive learning algorithms and representations for text categorization. In Proceedings of CIKM-98, 7th ACM International Conference on Information and Knowledge Man-agement (Bethesda, MD, 1998), 148-155.]] Google ScholarGoogle Scholar
  30. ESCUDERO, G., MARQUEZ, L., AND RIGAU, G. 2000. Boosting applied to word sense disambiguation. In Proceedings of ECML-00, 11th European Conference on Machine Learning (Barcelona, Spain, 2000), 129-141.]] Google ScholarGoogle Scholar
  31. FIELD, B. 1975. Towards automatic indexing: automatic assignment of controlled-language indexing and classification from free indexing. J. Document. 31, 4, 246-265.]]Google ScholarGoogle Scholar
  32. FORSYTH, R. S. 1999. New directions in text categorization. In Causal Models and Intelligent Data Management, A. Gammerman, ed. Springer, Heidelberg, Germany, 151-185.]]Google ScholarGoogle Scholar
  33. FRASCONI, P., SODA,G.,AND VULLO, A. 2002. Text categorization for multi-page documents: A hybrid naive Bayes HMM approach. J. Intell. Inform. Syst. 18, 2/3 (March-May), 195-217.]] Google ScholarGoogle Scholar
  34. FUHR, N. 1985. Aprobabilistic model of dictionarybased automatic indexing. In Proceedings of RIAO-85, 1st International Conference "Re-cherche d'Information Assistee par Ordinateur" (Grenoble, France, 1985), 207-216.]]Google ScholarGoogle Scholar
  35. FUHR, N. 1989. Models for retrieval with probabilistic indexing. Inform. Process. Man. 25,1,55- 72.]] Google ScholarGoogle Scholar
  36. FUHR,N.AND BUCKLEY, C. 1991. A probabilistic learning approach for document indexing. ACM Trans. Inform. Syst. 9, 3, 223-248.]] Google ScholarGoogle Scholar
  37. FUHR, N., HARTMANN, S., KNORZ, G., LUSTIG,G., SCHWANTNER, M., AND TZERAS, K. 1991. AIR/X"a rule-based multistage indexing system for large subject fields. In Proceedings of RIAO-91, 3rd International Conference "Recherche d'Information Assistee par Ordina-teur" (Barcelona, Spain, 1991), 606-623.]]Google ScholarGoogle Scholar
  38. FUHR,N.AND KNORZ, G. 1984. Retrieval test evaluation of a rule-based automated indexing (AIR/PHYS). In Proceedings of SIGIR-84, 7th ACM International Conference on Research and Development in Information Retrieval (Cambridge, UK, 1984), 391-408.]] Google ScholarGoogle Scholar
  39. FUHR,N.AND PFEIFER, U. 1994. Probabilistic information retrieval as combination of abstraction inductive learning and probabilistic assumptions. ACM Trans. Inform. Syst. 12,1, 92-115.]] Google ScholarGoogle Scholar
  40. FURNKRANZ, J. 1999. Exploiting structural information for text classification on the WWW. In Proceedings of IDA-99, 3rd Symposium on Intelligent Data Analysis (Amsterdam, The Netherlands, 1999), 487-497.]] Google ScholarGoogle Scholar
  41. GALAVOTTI, L., SEBASTIANI,F.,AND SIMI, M. 2000. Experiments on the use of feature selection and negative evidence in automated text categorization. In Proceedings of ECDL-00, 4th European Conference on Research and Advanced Technology for Digital Libraries (Lisbon, Portugal, 2000), 59-68.]] Google ScholarGoogle Scholar
  42. GALE, W. A., CHURCH,K.W.,AND YAROWSKY, D. 1993. A method for disambiguating word senses in a large corpus. Comput. Human. 26, 5, 415-439.]]Google ScholarGoogle Scholar
  43. GOVERT, N., LALMAS, M., AND FUHR, N. 1999. A probabillistic description-oriented approach for categorising Web documents. In Proceedings of CIKM-99, 8th ACM International Conference on Information and Knowledge Management (Kansas City, MO, 1999), 475-482.]] Google ScholarGoogle Scholar
  44. GRAY,W.A.AND HARLEY, A. J. 1971. Computerassisted indexing. Inform. Storage Retrieval 7, 4, 167-174.]]Google ScholarGoogle Scholar
  45. GUTHRIE, L., WALKER, E., AND GUTHRIE, J. A. 1994. Document classification by machine: theory and practice. In Proceedings of COLING-94, 15th International Conference on Computational Lin-guistics (Kyoto, Japan, 1994), 1059-1063.]] Google ScholarGoogle Scholar
  46. HAYES,P.J.,ANDERSEN, P. M., NIRENBURG,I.B., AND SCHMANDT, L. M. 1990. Tcs: a shell for content-based text categorization. In Proceedings of CAIA-90, 6th IEEE Conference on Artificial Intelligence Applications (Santa Barbara, CA, 1990), 320-326.]] Google ScholarGoogle Scholar
  47. HEAPS, H. 1973. A theory of relevance for automatic document classification. Inform. Control 22, 3, 268-278.]]Google ScholarGoogle Scholar
  48. HERSH, W., BUCKLEY, C., LEONE,T.,AND HICKMAN,D. 1994. OHSUMED: an interactive retrieval evaluation and new large text collection for research. In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval (Dublin, Ireland, 1994), 192-201.]] Google ScholarGoogle Scholar
  49. HULL, D. A. 1994. Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval (Dublin, Ireland, 1994), 282-289.]] Google ScholarGoogle Scholar
  50. HULL, D. A., PEDERSEN,J.O.,AND SCHUTZE, H. 1996. Method combination for document filtering. In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval (Zuuml;rich, Switzerland, 1996), 279-288.]] Google ScholarGoogle Scholar
  51. ITTNER,D.J.,LEWIS,D.D.,AND AHN, D. D. 1995. Text categorization of low quality images. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, NV, 1995), 301-315.]]Google ScholarGoogle Scholar
  52. IWAYAMA,M.AND TOKUNAGA, T. 1995. Cluster-based text categorization: a comparison of category search strategies. In Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval (Seattle, WA, 1995), 273-281.]] Google ScholarGoogle Scholar
  53. IYER,R.D.,LEWIS,D.D.,SCHAPIRE, R. E., SINGER,Y., AND SINGHAL, A. 2000. Boosting for document routing. In Proceedings of CIKM-00, 9th ACM International Conference on Information and Knowledge Management (McLean, VA, 2000), 70-77.]] Google ScholarGoogle Scholar
  54. JOACHIMS, T. 1997. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Proceedings of ICML-97, 14th International Conference on Machine Learning (Nashville, TN, 1997), 143-151.]] Google ScholarGoogle Scholar
  55. JOACHIMS, T. 1998. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemnitz, Germany, 1998), 137-142.]] Google ScholarGoogle Scholar
  56. JOACHIMS, T. 1999. Transductive inference for text classification using support vector machines. In Proceedings of ICML-99, 16th International Conference on Machine Learning (Bled, Slovenia, 1999), 200-209.]] Google ScholarGoogle Scholar
  57. JOACHIMS,T.AND SEBASTIANI, F. 2002. Guest editors' introduction to the special issue on automated text categorization. J. Intell. Inform. Syst. 18, 2/3 (March-May), 103-105.]] Google ScholarGoogle Scholar
  58. JOHN, G. H., KOHAVI, R., AND PFLEGER, K. 1994. Irrelevant features and the subset selection problem. In Proceedings of ICML-94, 11th International Conference on Machine Learning (New Brunswick, NJ, 1994), 121-129.]]Google ScholarGoogle Scholar
  59. JUNKER,M.AND ABECKER, A. 1997. Exploiting thesaurus knowledge in rule induction for text classification. In Proceedings of RANLP-97, 2nd International Conference on Recent Advances in Natural Language Processing (Tzigov Chark, Bulgaria, 1997), 202-207.]]Google ScholarGoogle Scholar
  60. JUNKER,M.AND HOCH, R. 1998. An experimental evaluation of OCR text representations for learning document classifiers. Internat. J. Document Analysis and Recognition 1, 2, 116-122.]]Google ScholarGoogle Scholar
  61. KESSLER, B., NUNBERG,G.,AND SCHUTZE, H. 1997. Automatic detection of text genre. In Proceedings of ACL-97, 35th Annual Meeting of the Association for Computational Linguistics (Madrid, Spain, 1997), 32-38.]] Google ScholarGoogle Scholar
  62. KIM, Y.-H., HAHN, S.-Y., AND ZHANG, B.-T. 2000. Text filtering by boosting naive Bayes classifiers. In Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval (Athens, Greece, 2000), 168-175.]] Google ScholarGoogle Scholar
  63. KLINKENBERG,R.AND JOACHIMS, T. 2000. Detecting concept drift with support vector machines. In Proceedings of ICML-00, 17th International Conference on Machine Learning (Stanford, CA, 2000), 487-494.]] Google ScholarGoogle Scholar
  64. KNIGHT, K. 1999. Mining online text. Commun. ACM 42, 11, 58-61.]] Google ScholarGoogle Scholar
  65. KNORZ, G. 1982. A decision theory approach to optimal automated indexing. In Proceedings of SIGIR-82, 5th ACM International Conference on Research and Development in Information Retrieval (Berlin, Germany, 1982), 174-193.]] Google ScholarGoogle Scholar
  66. KOLLER,D.AND SAHAMI, M. 1997. Hierarchically classifying documents using very few words. In Proceedings of ICML-97, 14th International Conference on Machine Learning (Nashville, TN, 1997), 170-178.]] Google ScholarGoogle Scholar
  67. KORFHAGE, R. R. 1997. Information Storage and Retrieval. Wiley Computer Publishing, New York, NY.]] Google ScholarGoogle Scholar
  68. LAM,S.L.AND LEE, D. L. 1999. Feature reduction for neural network based text categorization. In Proceedings of DASFAA-99, 6th IEEE International Conference on Database Advanced Systems for Advanced Application (Hsinchu, Taiwan, 1999), 195-202.]] Google ScholarGoogle Scholar
  69. LAM,W.AND HO, C. Y. 1998. Using a generalized instance set for automatic text categorization. In Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval (Melbourne, Australia, 1998), 81-89.]] Google ScholarGoogle Scholar
  70. LAM, W., LOW,K.F.,AND HO, C. Y. 1997. Using a Bayesian network induction approach for text categorization. In Proceedings of IJCAI-97, 15th International Joint Conference on Artificial Intelligence (Nagoya, Japan, 1997), 745-750.]]Google ScholarGoogle Scholar
  71. LAM, W., RUIZ,M.E.,AND SRINIVASAN, P. 1999. Automatic text categorization and its applications to text retrieval. IEEE Trans. Knowl. Data Engin. 11, 6, 865-879.]] Google ScholarGoogle Scholar
  72. LANG, K. 1995. NEWSWEEDER: learning to filter netnews. In Proceedings of ICML-95, 12th International Conference on Machine Learning (Lake Tahoe, CA, 1995), 331-339.]]Google ScholarGoogle Scholar
  73. LARKEY, L. S. 1998. Automatic essay grading using text categorization techniques. In Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval (Melbourne, Australia, 1998), 90-95.]] Google ScholarGoogle Scholar
  74. LARKEY, L. S. 1999. A patent search and classification system. In Proceedings of DL-99, 4th ACM Conference on Digital Libraries (Berkeley, CA, 1999), 179-187.]] Google ScholarGoogle Scholar
  75. LARKEY,L.S.AND CROFT, W. B. 1996. Combining classifiers in text categorization. In Proceedings of SIGIR-96, 19th ACMInternational Conference on Research and Development in Information Retrieval (Z~rich, Switzerland, 1996), 289-297.]] Google ScholarGoogle Scholar
  76. LEWIS, D. D. 1992a. An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of SIGIR-92, 15th ACM International Conference on Research and Development in Information Retrieval (Copenhagen, Denmark, 1992), 37-50.]] Google ScholarGoogle Scholar
  77. LEWIS, D. D. 1992b. Representation and Learning in Information Retrieval. Ph. D. thesis, Department of Computer Science, University of Massachusetts, Amherst, MA.]] Google ScholarGoogle Scholar
  78. LEWIS, D. D. 1995a. Evaluating and optmizing autonomous text classification systems. In Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval (Seattle, WA, 1995), 246- 254.]] Google ScholarGoogle Scholar
  79. LEWIS, D. D. 1995b. A sequential algorithm for training text classifiers: corrigendum and additional data. SIGIR Forum 29, 2, 13-19.]] Google ScholarGoogle Scholar
  80. LEWIS, D. D. 1995c. The TREC-4 filtering track: description and analysis. In Proceedings of TREC-4, 4th Text Retrieval Conference (Gaithersburg, MD, 1995), 165-180.]]Google ScholarGoogle Scholar
  81. LEWIS, D. D. 1998. Naive (Bayes) at forty: The independence assumption in information retrieval. In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemnitz, Germany, 1998), 4-15.]] Google ScholarGoogle Scholar
  82. LEWIS,D.D.AND CATLETT, J. 1994. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of ICML-94, 11th International Conference on Machine Learning (New Brunswick, NJ, 1994), 148-156.]]Google ScholarGoogle Scholar
  83. LEWIS,D.D.AND GALE, W. A. 1994. A sequential algorithm for training text classifiers. In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval (Dublin, Ireland, 1994), 3-12. See also Lewis {1995b}.]] Google ScholarGoogle Scholar
  84. LEWIS,D.D.AND HAYES, P. J. 1994. Guest editorial for the special issue on text categorization. ACM Trans. Inform. Syst. 12, 3, 231.]]Google ScholarGoogle Scholar
  85. LEWIS,D.D.AND RINGUETTE, M. 1994. A comparison of two learning algorithms for text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, NV, 1994), 81-93.]]Google ScholarGoogle Scholar
  86. LEWIS,D.D.,SCHAPIRE, R. E., CALLAN,J.P.,AND PAPKA, R. 1996. Training algorithms for linear text classifiers. In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval (Zurich, Switzerland, 1996), 298-306.]] Google ScholarGoogle Scholar
  87. LI,H.AND YAMANISHI, K. 1999. Text classification using ESC-based stochastic decision lists. In Proceedings of CIKM-99, 8th ACMInternational Conference on Information and Knowledge Management (Kansas City, MO, 1999), 122-130.]] Google ScholarGoogle Scholar
  88. LI,Y.H.AND JAIN, A. K. 1998. Classification of text documents. Comput. J. 41, 8, 537-546.]]Google ScholarGoogle Scholar
  89. LIDDY, E. D., PAIK,W.,AND YU, E. S. 1994. Text categorization for multiple users based on semantic features from a machine-readable dictionary. ACM Trans. Inform. Syst. 12, 3, 278-295.]] Google ScholarGoogle Scholar
  90. LIERE,R.AND TADEPALLI, P. 1997. Active learning with committees for text categorization. In Proceedings of AAAI-97, 14th Conference of the American Association for Artificial Intelligence (Providence, RI, 1997), 591-596.]]Google ScholarGoogle Scholar
  91. LIM, J. H. 1999. Learnable visual keywords for image classification. In Proceedings of DL-99, 4th ACM Conference on Digital Libraries (Berkeley, CA, 1999), 139-145.]] Google ScholarGoogle Scholar
  92. MANNING,C.AND SCHUTZE, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.]] Google ScholarGoogle Scholar
  93. MARON, M. 1961. Automatic indexing: an experimental inquiry. J. Assoc. Comput. Mach. 8,3, 404-417.]] Google ScholarGoogle Scholar
  94. MASAND, B. 1994. Optimising confidence of text classification by evolution of symbolic expressions. In Advances in Genetic Programming, K. E. Kinnear, ed. MIT Press, Cambridge, MA, Chapter 21, 459-476.]] Google ScholarGoogle Scholar
  95. MASAND, B., LINOFF,G.,AND WALTZ, D. 1992. Classifying news stories using memory-based reasoning. In Proceedings of SIGIR-92, 15th ACM International Conference on Research and Development in Information Retrieval (Copenhagen, Denmark, 1992), 59-65.]] Google ScholarGoogle Scholar
  96. MCCALLUM,A.K.AND NIGAM, K. 1998. Employing EM in pool-based active learning for text classification. In Proceedings of ICML-98, 15th International Conference on Machine Learning (Madison, WI, 1998), 350-358.]] Google ScholarGoogle Scholar
  97. MCCALLUM, A. K., ROSENFELD, R., MITCHELL,T.M.,AND NG, A. Y. 1998. Improving text classification by shrinkage in a hierarchy of classes. In Proceedings of ICML-98, 15th International Conference on Machine Learning (Madison, WI, 1998), 359-367.]] Google ScholarGoogle Scholar
  98. MERKL, D. 1998. Text classification with selforganizing maps: Some lessons learned. Neurocomputing 21, 1/3, 61-77.]]Google ScholarGoogle Scholar
  99. MITCHELL, T. M. 1996. Machine Learning. McGraw Hill, New York, NY.]] Google ScholarGoogle Scholar
  100. MLADENIC, D. 1998. Feature subset selection in text learning. In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemnitz, Germany, 1998), 95-100.]] Google ScholarGoogle Scholar
  101. MLADENIC,D.AND GROBELNIK, M. 1998. Word sequences as features in text-learning. In Proceedings of ERK-98, the Seventh Electrotechnical and Computer Science Conference (Ljubljana, Slovenia, 1998), 145-148.]]Google ScholarGoogle Scholar
  102. MOULINIER,I.AND GANASCIA, J.-G. 1996. Applying an existing machine learning algorithm to text categorization. In Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, S. Wermter, E. Riloff, and G. Schaler, eds. Springer Verlag, Heidelberg, Germany, 343-354.]] Google ScholarGoogle Scholar
  103. MOULINIER, I., RASKINIS,G.,AND GANASCIA, J.-G. 1996. Text categorization: a symbolic approach. In Proceedings of SDAIR-96, 5th Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, NV, 1996), 87-99.]]Google ScholarGoogle Scholar
  104. MYERS, K., KEARNS, M., SINGH,S.,AND WALKER, M. A. 2000. A boosting approach to topic spotting on subdialogues. In Proceedings of ICML-00, 17th International Conference on Machine Learning (Stanford, CA, 2000), 655- 662.]] Google ScholarGoogle Scholar
  105. NG,H.T.,GOH,W.B.,AND LOW, K. L. 1997. Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of SIGIR-97, 20th ACM International Conference on Research and Development in Information Retrieval (Philadelphia, PA, 1997), 67-73.]] Google ScholarGoogle Scholar
  106. NIGAM, K., MCCALLUM, A. K., THRUN,S.,AND MITCHELL, T. M. 2000. Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39, 2/3, 103-134.]] Google ScholarGoogle Scholar
  107. OH, H.-J., MYAENG,S.H.,AND LEE, M.-H. 2000. A practical hypertext categorization method using links and incrementally available class information. In Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval (Athens, Greece, 2000), 264-271.]] Google ScholarGoogle Scholar
  108. PAZIENZA, M. T., ed. 1997. Information Extraction. Lecture Notes in Computer Science, Vol. 1299. Springer, Heidelberg, Germany.]] Google ScholarGoogle Scholar
  109. RILOFF. E. 1995. Little words can make a big difference for text classification. In Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval (Seattle, WA, 1995), 130-136.]] Google ScholarGoogle Scholar
  110. RILOFF,E.AND LEHNERT, W. 1994. Information extraction as a basis for high-precision text classification. ACMTrans. Inform. Syst. 12, 3, 296-333.]] Google ScholarGoogle Scholar
  111. ROBERTSON,S.E.AND HARDING, P. 1984. Probabilistic automatic indexing by learning from human indexers. J. Document. 40, 4, 264-270.]]Google ScholarGoogle Scholar
  112. ROBERTSON,S.E.AND SPARCK JONES, K. 1976. Relevance weighting of search terms. J. Amer. Soc. Inform. Sci. 27, 3, 129-146. Also reprinted in Willett {1988}, pp. 143-160.]]Google ScholarGoogle Scholar
  113. ROTH, D. 1998. Learning to resolve natural language ambiguities: a unified approach. In Proceedings of AAAI-98, 15th Conference of the American Association for Artificial Intelligence (Madison, WI, 1998), 806-813.]] Google ScholarGoogle Scholar
  114. RUIZ,M.E.AND SRINIVASAN, P. 1999. Hierarchical neural networks for text categorization. In Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval (Berkeley, CA, 1999), 281-282.]] Google ScholarGoogle Scholar
  115. SABLE,C.L.AND HATZIVASSILOGLOU, V. 2000. Textbased approaches for non-topical image categorization. Internat. J. Dig. Libr. 3, 3, 261-275.]]Google ScholarGoogle Scholar
  116. SALTON,G.AND BUCKLEY, C. 1988. Term-weighting approaches in automatic text retrieval. Inform. Process. Man. 24, 5, 513-523. Also reprinted in Sparck Jones and Willett {1997}, pp. 323-328.]] Google ScholarGoogle Scholar
  117. SALTON, G., WONG, A., AND YANG, C. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11, 613-620. Also reprinted in Sparck Jones and Willett {1997}, pp. 273-280.]] Google ScholarGoogle Scholar
  118. SARACEVIC, T. 1975. Relevance: a review of and a framework for the thinking on the notion in information science. J. Amer. Soc. Inform. Sci. 26, 6, 321-343. Also reprinted in Sparck Jones and Willett {1997}, pp. 143-165.]] Google ScholarGoogle Scholar
  119. SCHAPIRE,R.E.AND SINGER, Y. 2000. BoosTexter: a boosting-based system for text categorization. Mach. Learn. 39, 2/3, 135-168.]] Google ScholarGoogle Scholar
  120. SCHAPIRE, R. E., SINGER,Y.,AND SINGHAL, A. 1998. Boosting and Rocchio applied to text filtering. In Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval (Melbourne, Australia, 1998), 215-223.]] Google ScholarGoogle Scholar
  121. SCHUTZE, H. 1998. Automatic word sense discrimination. Computat. Ling. 24, 1, 97-124.]] Google ScholarGoogle Scholar
  122. SCHUTZE, H., HULL,D.A.,AND PEDERSEN, J. O. 1995. A comparison of classifiers and document representations for the routing problem. In Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval (Seattle, WA, 1995), 229-237.]] Google ScholarGoogle Scholar
  123. SCOTT,S.AND MATWIN, S. 1999. Feature engineering for text classification. In Proceedings of ICML-99, 16th International Conference on Machine Learning (Bled, Slovenia, 1999), 379-388.]] Google ScholarGoogle Scholar
  124. SEBASTIANI, F., SPERDUTI, A., AND VALDAMBRINI,N. 2000. An improved boosting algorithm and its application to automated text categorization. In Proceedings of CIKM-00, 9th ACMInternational Conference on Information and Knowledge Management (McLean, VA, 2000), 78-85.]] Google ScholarGoogle Scholar
  125. SINGHAL, A., MITRA, M., AND BUCKLEY, C. 1997. Learning routing queries in a query zone. In Proceedings of SIGIR-97, 20th ACM International Conference on Research and Development in Information Retrieval (Philadelphia, PA, 1997), 25-32.]] Google ScholarGoogle Scholar
  126. SINGHAL, A., SALTON, G., MITRA, M., AND BUCKLEY, C. 1996. Document length normalization. Inform. Process. Man. 32, 5, 619-633.]] Google ScholarGoogle Scholar
  127. SLONIM,N.AND TISHBY, N. 2001. The power of word clusters for text classification. In Proceedings of ECIR-01, 23rd European Colloquium on Information Retrieval Research (Darmstadt, Germany, 2001).]]Google ScholarGoogle Scholar
  128. SPARCK JONES,K.AND WILLETT, P., eds. 1997. Readings in Information Retrieval. Morgan Kaufmann, San Mateo, CA.]] Google ScholarGoogle Scholar
  129. TAIRA,H.AND HARUNO, M. 1999. Feature selection in SVM text categorization. In Proceedings of AAAI-99, 16th Conference of the American Association for Artificial Intelligence (Orlando, FL, 1999), 480-486.]] Google ScholarGoogle Scholar
  130. TAURITZ, D. R., KOK,J.N.,AND SPRINKHUIZEN-KUYPER, I. G. 2000. Adaptive information filtering using evolutionary computation. Inform. Sci. 122, 2-4, 121-140.]] Google ScholarGoogle Scholar
  131. TUMER,K.AND GHOSH, J. 1996. Error correlation and error reduction in ensemble classifiers. Connection Sci. 8, 3-4, 385-403.]]Google ScholarGoogle Scholar
  132. TZERAS,K.AND HARTMANN, S. 1993. Automatic indexing based on Bayesian inference networks. In Proceedings of SIGIR-93, 16th ACM International Conference on Research and Development in Information Retrieval (Pittsburgh, PA, 1993), 22-34.]] Google ScholarGoogle Scholar
  133. VAN RIJSBERGEN, C. J. 1977. A theoretical basis for the use of co-occurrence data in information retrieval. J. Document. 33, 2, 106-119.]]Google ScholarGoogle Scholar
  134. VAN RIJSBERGEN, C. J. 1979. Information Retrieval, 2nd ed. Butterworths, London, UK. Available at http://www.dcs.gla.ac.uk/Keith.]] Google ScholarGoogle Scholar
  135. WEIGEND,A.S.,WIENER,E.D.,AND PEDERSEN,J.O. 1999. Exploiting hierarchy in text catagorization. Inform. Retr. 1, 3, 193-216.]] Google ScholarGoogle Scholar
  136. WEISS, S. M., APT~, C., DAMERAU,F.J.,JOHNSON,D. E., OLES,F.J.,GOETZ,T.,AND HAMPP, T. 1999. Maximizing text-mining performance. IEEE Intell. Syst. 14, 4, 63-69.]] Google ScholarGoogle Scholar
  137. WIENER,E.D.,PEDERSEN,J.O.,AND WEIGEND,A.S. 1995. A neural network approach to topic spotting. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, NV, 1995), 317-332.]]Google ScholarGoogle Scholar
  138. WILLETT, P., ed. 1988. Document Retrieval Systems. Taylor Graham, London, UK.]] Google ScholarGoogle Scholar
  139. WONG,J.W.,KAN, W.-K., AND YOUNG, G. H. 1996. ACTION: automatic classification for full-text documents. SIGIR Forum 30, 1, 26-41.]] Google ScholarGoogle Scholar
  140. YANG, Y. 1994. Expert network: effective and efficient learning from human decisions in text categorisation and retrieval. In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval (Dublin, Ireland, 1994), 13-22.]] Google ScholarGoogle Scholar
  141. YANG, Y. 1995. Noise reduction in a statistical approach to text categorization. In Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval (Seattle, WA, 1995), 256-263.]] Google ScholarGoogle Scholar
  142. YANG, Y. 1999. An evaluation of statistical approaches to text categorization. Inform. Retr. 1, 1-2, 69-90.]] Google ScholarGoogle Scholar
  143. YANG,Y.AND CHUTE, C. G. 1994. An example-based mapping method for text categorization and retrieval. ACMTrans. Inform. Syst. 12, 3, 252-277.]] Google ScholarGoogle Scholar
  144. YANG,Y.AND LIU, X. 1999. A re-examination of text categorization methods. In Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval (Berkeley, CA, 1999), 42-49.]] Google ScholarGoogle Scholar
  145. YANG,Y.AND PEDERSEN, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of ICML-97, 14th International Conference on Machine Learning (Nashville, TN, 1997), 412-420.]] Google ScholarGoogle Scholar
  146. YANG, Y., SLATTERY,S.,AND GHANI, R. 2002. A study of approaches to hypertext categorization. J. Intell. Inform. Syst. 18, 2/3 (March-May), 219-241.]] Google ScholarGoogle Scholar
  147. YU,K.L.AND LAM, W. 1998. A new on-line learning algorithm for adaptive text filtering. In Proceedings of CIKM-98, 7th ACMInternational Conference on Information and Knowledge Management (Bethesda, MD, 1998), 156-160.]] Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader