Machine learning in automated text categorisation

Machine learning in automated text categorisationDecember 1999

December 1999

1999 Technical Report

Author:
Fabrizio Sebastiani

Publisher:

Centre National de la Recherche Scientifique
31 Chemin Joseph Aiguier 13274 Marseille Cedex Z Paris
France

Published:06 December 1999

Bibliometrics

Abstract

The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to the early ''60s. Until the late ''80s, the most effective approach to the problem seemed to be that of manually building automatic classifiers by means of {\em knowledge-engineering} techniques, i.e.\ manually defining a set of rules encoding expert knowledge on how to classify documents under a given set of categories. In the ''90s, with the booming production and availability of on-line documents, automated text categorisation has witnessed an increased and renewed interest, prompted by which the {\em machine learning} paradigm to automatic classifier construction has emerged and definitely superseded the knowledge-engineering approach. Within the machine learning paradigm, a general inductive process (called the {\em learner}) automatically builds a classifier (also called the {\em rule}, or the {\em hypothesis}) by ``learning'''', from a set of previously classified documents, the characteristics of one or more categories. The advantages of this approach are a very good effectiveness, a considerable savings in terms of expert manpower, and domain independence. In this survey we look at the main approaches that have been taken towards automatic text categorisation within the general machine learning paradigm. Issues pertaining to document indexing, classifier construction, and classifier evaluation, will be discussed in detail. A final section will be devoted to the techniques that have specifically been devised for an emerging application such as the automatic classification of Web pages into ``{\sc Yahoo!}-like'''' hierarchically structured sets of categories.

Cited By

Contributors

Fabrizio Sebastiani
Italian National Research Council
- Publication Years1988 - 2024
- Publication counts121
- Citation count6,713
- Available for Download53
- Downloads (cumulative)65,549
- Downloads (12 months)3,595
- Downloads (6 weeks)706
- Average Downloads per Article1,237
- Average Citation per Article55
View Full Profile

Recommendations

Machine learning in automated text categorization

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the ...
Read More
Combining machine learning and hierarchical structures for text categorization
Read More
Machine learning for Arabic text categorization: Research Articles

In this article we propose a distance-based classifier for categorizing Arabic text. Each category is represented as a vector of words in an m-dimensional space, and documents are classified on the basis of their closeness to feature vectors of ...
Read More

Comments

Browse Reports

Sections

Cited By

Machine learning in automated text categorization

Combining machine learning and hierarchical structures for text categorization

Machine learning for Arabic text categorization: Research Articles

Save to Binder

Sections

Cited By

Save to Binder

Recommendations

Machine learning in automated text categorization

Combining machine learning and hierarchical structures for text categorization

Machine learning for Arabic text categorization: Research Articles