skip to main content
Machine learning in automated text categorisationDecember 1999
1999 Technical Report
Publisher:
  • Centre National de la Recherche Scientifique
  • 31 Chemin Joseph Aiguier 13274 Marseille Cedex Z Paris
  • France
Published:06 December 1999
Bibliometrics
Skip Abstract Section
Abstract

The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to the early ''60s. Until the late ''80s, the most effective approach to the problem seemed to be that of manually building automatic classifiers by means of {\em knowledge-engineering} techniques, i.e.\ manually defining a set of rules encoding expert knowledge on how to classify documents under a given set of categories. In the ''90s, with the booming production and availability of on-line documents, automated text categorisation has witnessed an increased and renewed interest, prompted by which the {\em machine learning} paradigm to automatic classifier construction has emerged and definitely superseded the knowledge-engineering approach. Within the machine learning paradigm, a general inductive process (called the {\em learner}) automatically builds a classifier (also called the {\em rule}, or the {\em hypothesis}) by ``learning'''', from a set of previously classified documents, the characteristics of one or more categories. The advantages of this approach are a very good effectiveness, a considerable savings in terms of expert manpower, and domain independence. In this survey we look at the main approaches that have been taken towards automatic text categorisation within the general machine learning paradigm. Issues pertaining to document indexing, classifier construction, and classifier evaluation, will be discussed in detail. A final section will be devoted to the techniques that have specifically been devised for an emerging application such as the automatic classification of Web pages into ``{\sc Yahoo!}-like'''' hierarchically structured sets of categories.

Cited By

  1. ACM
    Xiaofei Z, Li G, Jianlong T and Wenhan J Theme word subspace method for text document categorization Proceedings of the Data Mining and Intelligent Knowledge Management Workshop, (1-7)
  2. Si L, Yu D, Kihara D and Fang Y (2008). Combining gene sequence similarity and textual information for gene function annotation in the literature, Information Retrieval, 11:5, (389-404), Online publication date: 1-Oct-2008.
  3. Wang W, Do D and Lin X Term graph model for text classification Proceedings of the First international conference on Advanced Data Mining and Applications, (19-30)
  4. Téllez-Valero A, Montes-y-Gómez M and Villaseñor-Pineda L A machine learning approach to information extraction Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing, (539-547)
  5. Jeong O and Cho D A rule filtering component based on recommendation agent system for classifying email document Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies, (729-735)
  6. ACM
    Wang G and Lochovsky F Feature selection with conditional mutual information maximin in text categorization Proceedings of the thirteenth ACM international conference on Information and knowledge management, (342-349)
  7. Kolari P and Joshi A (2004). Web Mining, Computing in Science and Engineering, 6:4, (49-53), Online publication date: 1-Jul-2004.
  8. Moissinac J, Yvon F and Hazez S Automating indexing of classes and conferences Coupling approaches, coupling media and coupling languages for information retrieval, (885-894)
  9. Villar J, Benavides C, Garcia I, Alonso A and Rodriguez F (2018). A web-based multi-agent system approach to document engineering, International Journal of Web Engineering and Technology, 1:4, (437-453), Online publication date: 1-Feb-2004.
  10. Kim S and Chung C Ranking web documents with dynamic evaluation by expert groups Proceedings of the 15th international conference on Advanced information systems engineering, (437-448)
  11. ACM
    Gee K Using latent semantic indexing to filter spam Proceedings of the 2003 ACM symposium on Applied computing, (460-464)
  12. Sinka M and Corne D Evolving better stoplists for document clustering and web intelligence Design and application of hybrid intelligent systems, (1015-1023)
  13. Tan C, Wang Y and Lee C (2018). The use of bigrams to enhance text categorization, Information Processing and Management: an International Journal, 38:4, (529-546), Online publication date: 12-Jul-2002.
  14. Tong S and Koller D (2002). Support vector machine active learning with applications to text classification, The Journal of Machine Learning Research, 2, (45-66), Online publication date: 1-Mar-2002.
  15. Zaïane O and Antonie M (2002). Classifying text documents by associating terms with text categories, Australian Computer Science Communications, 24:2, (215-222), Online publication date: 1-Jan-2002.
  16. ACM
    Zelikovitz S and Hirsh H Using LSI for text classification in the presence of background text Proceedings of the tenth international conference on Information and knowledge management, (113-118)
  17. ACM
    Sebastiani F, Sperduti A and Valdambrini N An improved boosting algorithm and its application to text categorization Proceedings of the ninth international conference on Information and knowledge management, (78-85)
Contributors
  • Italian National Research Council

Recommendations