research-article

Improving Text Classification Accuracy by Training Label Cleaning

Authors:
Andrea Esuli

Consiglio Nazionale delle Ricerche, Italy

Consiglio Nazionale delle Ricerche, Italy
View Profile

,
Fabrizio Sebastiani

Consiglio Nazionale delle Ricerche, Italy

Consiglio Nazionale delle Ricerche, Italy
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 31 Issue 4Article No.: 19pp 1–28https://doi.org/10.1145/2516889

Published:01 November 2013Publication History

ACM Transactions on Information Systems

Abstract

In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain. Semisupervised learning and active learning are two strategies whose aim is maximizing the effectiveness of the resulting classifiers for a given amount of training effort. Both strategies have been actively investigated for TC in recent years. Much less research has been devoted to a third such strategy, training label cleaning (TLC), which consists in devising ranking functions that sort the original training examples in terms of how likely it is that the human annotator has mislabelled them. This provides a convenient means for the human annotator to revise the training set so as to improve its quality. Working in the context of boosting-based learning methods for multilabel classification we present three different techniques for performing TLC and, on three widely used TC benchmarks, evaluate them by their capability of spotting training documents that, for experimental reasons only, we have purposefully mislabelled. We also evaluate the degradation in classification effectiveness that these mislabelled texts bring about, and to what extent training label cleaning can prevent this degradation.

References

Abney, S., Schapire, R. E., and Singer, Y. 1999. Boosting applied to tagging and PP attachment. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC’99). 38--45.Google Scholar
Agarwal, S., Godbole, S., Punjani, D., and Roy, S. 2007. How much noise is too much: A study in automatic text classification. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM’07). 3--12. Google ScholarDigital Library
Argamon-Engelson, S. and Dagan, I. 1999. Committee-based sample selection for probabilistic classifiers. J. Artif. Intell. Res. 11, 335--360.Google ScholarCross Ref
Breiman, L. 1996. Bagging predictors. Machine Learning 24, 2, 123--140. Google ScholarDigital Library
Brodley, C. E. and Friedl, M. A. 1996. Identifying and eliminating mislabeled training instances. In Proceedings of the 13th Conference of the American Association for Artificial Intelligence (AAAI’96). 799--805. Google ScholarDigital Library
Chapelle, O., Schölkopf, B., and Zien, A., Eds. 2006. Semi-Supervised Learning. MIT Press, Cambridge, MA.Google Scholar
Cohn, D., Atlas, L., and Ladner, R. 1994. Improving generalization with active learning. Machine Learn. 15, 2, 201--221. Google ScholarDigital Library
Dickinson, M. and Meurers, W. D. 2003. Detecting errors in part-of-speech annotation. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL’03). 107--114. Google ScholarDigital Library
Dietterich, T. G. 2000. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learn. 40, 2, 139--157. Google ScholarDigital Library
Eskin, E. 2000. Detecting errors within a corpus using anomaly detection. In Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’00). 148--153. Google ScholarDigital Library
Esuli, A. and Sebastiani, F. 2009. Training data cleaning for text classification. In Proceedings of the 2nd International Conference on the Theory of Information Retrieval (ICTIR’09). 29--41. Google ScholarDigital Library
Esuli, A. and Sebastiani, F. 2010. Machines that learn how to code open-ended survey data. Int. J. Market Res. 52, 6, 775--800.Google ScholarCross Ref
Esuli, A., Fagni, T., and Sebastiani, F. 2006. MP-Boost: A multiple-pivot boosting algorithm and its application to text categorization. In Proceedings of the 13th International Symposium on String Processing and Information Retrieval (SPIRE’06). 1--12. Google ScholarDigital Library
Freund, Y., Seung, H. S., Shamir, E., and Tishby, N. 1992. Information, prediction, and query by committee. In Advances in Neural Information Processing Systems, Vol. 5, MIT Press, Cambridge, MA, 483--490. Google ScholarDigital Library
Friedman, J., Hastie, T., and Tibshirani, R. J. 2000. Additive logistic regression: A statistical view of boosting. Ann. Statist. 2, 337--374.Google ScholarCross Ref
Fukumoto, F. and Suzuki, Y. 2004. Correcting category errors in text classification. In Proceedings of the 20th International Conference on Computational Linguistics (COLING’04). 868--874. Google ScholarDigital Library
Galavotti, L., Sebastiani, F., and Simi, M. 2000. Experiments on the use of feature selection and negative evidence in automated text categorization. In Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries (ECDL’00). 59--68. Google ScholarDigital Library
Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Comput. 4, 1, 1--58. Google ScholarDigital Library
Grady, C. and Lease, M. 2010. Crowdsourcing document relevance assessment with Mechanical fiTurk. In Proceedings of the NAACL HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 172--179. Google ScholarDigital Library
Hersh, W., Buckley, C., Leone, T., and Hickman, D. 1994. OHSUMED: An interactive retrieval evaluation and new large text collection for research. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR’94). 192--201. Google ScholarDigital Library
Järvelin, K. and Kekäläinen, J. 2000. IR evaluation methods for retrieving highly relevant documents. In Proceedings of the 23rd ACM International Conference on Research and Development in Information Retrieval (SIGIR’00). 41--48. Google ScholarDigital Library
John, G. H. 1995. Robust decision trees: Removing outliers from databases. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (KDD’95). 174--179.Google Scholar
Lewis, D. D. 2004. Reuters-21578 text categorization test collection Distribution 1.0 README file (v 1.3). http://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt.Google Scholar
Lewis, D. D., Schapire, R. E., Callan, J. P., and Papka, R. 1996. Training algorithms for linear text classifiers. In Proceedings of the 19th ACM International Conference on Research and Development in Information Retrieval (SIGIR’96). 298--306. Google ScholarDigital Library
Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. 2004. RCV1: A new benchmark collection for text categorization research. J. Machine Learn. Res. 5, 361--397. Google ScholarDigital Library
Maclin, R. and Opitz, D. W. 1997. An empirical evaluation of bagging and boosting. In Proceedings of the 14th Conference of the American Association for Artificial Intelligence (AAAI’97). 546--551. Google ScholarDigital Library
Malik, H. H. and Bhardwaj, V. S. 2011. Automatic training data cleaning for text classification. In Proceedings of the ICDM Workshop on Domain-Driven Data Mining. 442--449. Google ScholarDigital Library
Murata, M., Utiyama, M., Uchimoto, K., Isahara, H., and Ma, Q. 2005. Correction of errors in a verb modality corpus for machine translation with a machine-learning method. ACM Trans. Asian Lang. Inform. Process. 4, 1, 18--37. Google ScholarDigital Library
Nakagawa, T. and Matsumoto, Y. 2002. Detecting errors in corpora using support vector machines. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02). 1--7. Google ScholarDigital Library
Resta, G. 2012. On the expected average precision of the random ranker. Tech. rep. IIT TR-04/2012, Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, Pisa, IT. http://www.iit.cnr.it/sites/default/files/TR-04-2012.pdf.Google Scholar
Schapire, R. and Singer, Y. 1999. Improved boosting using confidence-rated predictions. Machine Learn. 37, 3, 297--336. Google ScholarDigital Library
Schapire, R. E. and Singer, Y. 2000. Boostexter: A boosting-based system for text categorization. Machine Learn. 39, 2/3, 135--168. Google ScholarDigital Library
Schapire, R. E. and Freund, Y. 2012. Boosting: Foundations and Algorithms. MIT Press, Cambridge, MA. Google ScholarCross Ref
Shinnou, H. 2001. Detection of errors in training data by using a decision list and Adaboost. In Proceedings of the IJCAI Workshop on Text Learning Beyond Supervision.Google Scholar
Sindhwani, V. and Keerthi, S. S. 2006. Large scale semi-supervised linear SVMs. In Proceedings of the 29th ACM International Conference on Research and Development in Information Retrieval (SIGIR’06). 477--484. Google ScholarDigital Library
Snow, R., O’Connor, B., Jurafsky, D., and Ng, A. Y. 2008. Cheap and fast - but is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’08). 254--263. Google ScholarDigital Library
Vinciarelli, A. 2005. Noisy text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 27, 12, 1882--1895. Google ScholarDigital Library
Yang, Y. 1994. Expert network: Effective and efficient learning from human decisions in text categorisation and retrieval. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR’94). 13--22. Google ScholarDigital Library
Yang, Y. 1999. An evaluation of statistical approaches to text categorization. Inf. Retriev. 1, 1/2, 69--90. Google ScholarDigital Library
Yih, W.-T., McCann, R., and Kolcz, A. 2007. Improving spam filtering by detecting gray mail. In Proceedings of the 4th Conference on Email and Anti-Spam (CEAS’07).Google Scholar
Yokoyama, M., Matsui, T., and Ohwada, H. 2005. Detecting and revising misclassifications using ILP. In Proceedings of the 8th International Conference on Discovery Science (DS’05). 75--80. Google ScholarDigital Library
Yu, K., Zhu, S., Xu, W., and Gong, Y. 2008. Non-greedy active learning for text categorization using convex transductive experimental design. In Proceedings of the 31st ACM International Conference on Research and Development in Information Retrieval (SIGIR’08). 635--642. Google ScholarDigital Library
Zeng, X. and Martinez, T. R. 2001. An algorithm for correcting mislabeled data. Intell. Data Anal. 5, 6, 491--502. Google ScholarDigital Library
Zhu, X. and Goldberg, A. B. 2009. Introduction to Semi-Supervised Learning. Morgan and Claypool, San Rafael, CA. Google ScholarDigital Library

Index Terms

Improving Text Classification Accuracy by Training Label Cleaning
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

Effective multi-label active learning for text classification
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Labeling text data is quite time-consuming but essential for automatic text classification. Especially, manually creating multiple labels for each document may become impractical when a very large amount of data is needed for training multi-label text ...
Read More
Automatic Training Data Cleaning for Text Classification
ICDMW '11: Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops

Supervised text classification algorithms rely on the availability of large quantities of quality training data to achieve their optimal performance. However, not all training data is created equal and the quality of class-labels assigned by human ...
Read More
Vertical Ensemble Co-Training for Text Classification
Regular Papers

High-quality, labeled data is essential for successfully applying machine learning methods to real-world text classification problems. However, in many cases, the amount of labeled data is very small compared to that of the unlabeled, and labeling ...
Read More

Reviews

Reviewer: Jun Ping Ng

A large-scale study on the use of training label cleaning (TLC) to improve text classification is described in this paper. The purpose of TLC is to identify potentially mislabeled instances in a training dataset, and to flag them for closer inspection by human annotators. The underlying premise for doing this is that incorrect annotations can have a significant, adverse impact on the performance of classifiers. TLC is slightly different from active learning, where potentially useful, unlabeled instances are flagged for human annotation. The paper makes use of several well-known datasets, and examines the impact that incorrect annotations can have on classifier performance. The authors also detail three main techniques for TLC, and evaluate how these can help identify instances of incorrect annotations, resulting in improvements to text classification performance. This well-written paper was a joy to read. The experiments are extensive and sound. The authors share many useful insights into the importance of annotation integrity, and also present an illuminating discussion of the results they obtained. Readers who want to find out more about TLC may be slightly disappointed, as the paper does not go into much depth on the actual techniques used. However, TLC is already well covered in existing literature [1,2], so this is not a big problem. Some parts of the methodology and experiments could have been better structured for a more fluent read (for example, the section on using support vector machines (SVM) to refute doubts about the use of MP-Boost seems a lot like an afterthought), but the paper is worth reading nonetheless for the many observations and insights it contains. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Information Systems Volume 31, Issue 4
November 2013
192 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/2536736
Editor:
Jamie Callan
Carnegie Mellon University, USA
Issue’s Table of Contents
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 November 2013
- Accepted: 1 June 2013
- Revised: 1 April 2013
- Received: 1 June 2012
Published in tois Volume 31, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Text classification
supervised learning
synthetic noise
training label cleaning
training label noise
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 628
  Total Downloads
- Downloads (Last 12 months)34
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Improving Text Classification Accuracy by Training Label Cleaning

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Effective multi-label active learning for text classification

Automatic Training Data Cleaning for Text Classification

Vertical Ensemble Co-Training for Text Classification

Reviews

Access critical reviews of Computing literature here