skip to main content
research-article

Improving Text Classification Accuracy by Training Label Cleaning

Published:01 November 2013Publication History
Skip Abstract Section

Abstract

In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain. Semisupervised learning and active learning are two strategies whose aim is maximizing the effectiveness of the resulting classifiers for a given amount of training effort. Both strategies have been actively investigated for TC in recent years. Much less research has been devoted to a third such strategy, training label cleaning (TLC), which consists in devising ranking functions that sort the original training examples in terms of how likely it is that the human annotator has mislabelled them. This provides a convenient means for the human annotator to revise the training set so as to improve its quality. Working in the context of boosting-based learning methods for multilabel classification we present three different techniques for performing TLC and, on three widely used TC benchmarks, evaluate them by their capability of spotting training documents that, for experimental reasons only, we have purposefully mislabelled. We also evaluate the degradation in classification effectiveness that these mislabelled texts bring about, and to what extent training label cleaning can prevent this degradation.

References

  1. Abney, S., Schapire, R. E., and Singer, Y. 1999. Boosting applied to tagging and PP attachment. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC’99). 38--45.Google ScholarGoogle Scholar
  2. Agarwal, S., Godbole, S., Punjani, D., and Roy, S. 2007. How much noise is too much: A study in automatic text classification. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM’07). 3--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Argamon-Engelson, S. and Dagan, I. 1999. Committee-based sample selection for probabilistic classifiers. J. Artif. Intell. Res. 11, 335--360.Google ScholarGoogle ScholarCross RefCross Ref
  4. Breiman, L. 1996. Bagging predictors. Machine Learning 24, 2, 123--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Brodley, C. E. and Friedl, M. A. 1996. Identifying and eliminating mislabeled training instances. In Proceedings of the 13th Conference of the American Association for Artificial Intelligence (AAAI’96). 799--805. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chapelle, O., Schölkopf, B., and Zien, A., Eds. 2006. Semi-Supervised Learning. MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  7. Cohn, D., Atlas, L., and Ladner, R. 1994. Improving generalization with active learning. Machine Learn. 15, 2, 201--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Dickinson, M. and Meurers, W. D. 2003. Detecting errors in part-of-speech annotation. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL’03). 107--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Dietterich, T. G. 2000. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learn. 40, 2, 139--157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Eskin, E. 2000. Detecting errors within a corpus using anomaly detection. In Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’00). 148--153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Esuli, A. and Sebastiani, F. 2009. Training data cleaning for text classification. In Proceedings of the 2nd International Conference on the Theory of Information Retrieval (ICTIR’09). 29--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Esuli, A. and Sebastiani, F. 2010. Machines that learn how to code open-ended survey data. Int. J. Market Res. 52, 6, 775--800.Google ScholarGoogle ScholarCross RefCross Ref
  13. Esuli, A., Fagni, T., and Sebastiani, F. 2006. MP-Boost: A multiple-pivot boosting algorithm and its application to text categorization. In Proceedings of the 13th International Symposium on String Processing and Information Retrieval (SPIRE’06). 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Freund, Y., Seung, H. S., Shamir, E., and Tishby, N. 1992. Information, prediction, and query by committee. In Advances in Neural Information Processing Systems, Vol. 5, MIT Press, Cambridge, MA, 483--490. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Friedman, J., Hastie, T., and Tibshirani, R. J. 2000. Additive logistic regression: A statistical view of boosting. Ann. Statist. 2, 337--374.Google ScholarGoogle ScholarCross RefCross Ref
  16. Fukumoto, F. and Suzuki, Y. 2004. Correcting category errors in text classification. In Proceedings of the 20th International Conference on Computational Linguistics (COLING’04). 868--874. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Galavotti, L., Sebastiani, F., and Simi, M. 2000. Experiments on the use of feature selection and negative evidence in automated text categorization. In Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries (ECDL’00). 59--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Comput. 4, 1, 1--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Grady, C. and Lease, M. 2010. Crowdsourcing document relevance assessment with Mechanical fiTurk. In Proceedings of the NAACL HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 172--179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hersh, W., Buckley, C., Leone, T., and Hickman, D. 1994. OHSUMED: An interactive retrieval evaluation and new large text collection for research. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR’94). 192--201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Järvelin, K. and Kekäläinen, J. 2000. IR evaluation methods for retrieving highly relevant documents. In Proceedings of the 23rd ACM International Conference on Research and Development in Information Retrieval (SIGIR’00). 41--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. John, G. H. 1995. Robust decision trees: Removing outliers from databases. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (KDD’95). 174--179.Google ScholarGoogle Scholar
  23. Lewis, D. D. 2004. Reuters-21578 text categorization test collection Distribution 1.0 README file (v 1.3). http://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt.Google ScholarGoogle Scholar
  24. Lewis, D. D., Schapire, R. E., Callan, J. P., and Papka, R. 1996. Training algorithms for linear text classifiers. In Proceedings of the 19th ACM International Conference on Research and Development in Information Retrieval (SIGIR’96). 298--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. 2004. RCV1: A new benchmark collection for text categorization research. J. Machine Learn. Res. 5, 361--397. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Maclin, R. and Opitz, D. W. 1997. An empirical evaluation of bagging and boosting. In Proceedings of the 14th Conference of the American Association for Artificial Intelligence (AAAI’97). 546--551. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Malik, H. H. and Bhardwaj, V. S. 2011. Automatic training data cleaning for text classification. In Proceedings of the ICDM Workshop on Domain-Driven Data Mining. 442--449. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Murata, M., Utiyama, M., Uchimoto, K., Isahara, H., and Ma, Q. 2005. Correction of errors in a verb modality corpus for machine translation with a machine-learning method. ACM Trans. Asian Lang. Inform. Process. 4, 1, 18--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Nakagawa, T. and Matsumoto, Y. 2002. Detecting errors in corpora using support vector machines. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02). 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Resta, G. 2012. On the expected average precision of the random ranker. Tech. rep. IIT TR-04/2012, Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, Pisa, IT. http://www.iit.cnr.it/sites/default/files/TR-04-2012.pdf.Google ScholarGoogle Scholar
  31. Schapire, R. and Singer, Y. 1999. Improved boosting using confidence-rated predictions. Machine Learn. 37, 3, 297--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Schapire, R. E. and Singer, Y. 2000. Boostexter: A boosting-based system for text categorization. Machine Learn. 39, 2/3, 135--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Schapire, R. E. and Freund, Y. 2012. Boosting: Foundations and Algorithms. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarCross RefCross Ref
  34. Shinnou, H. 2001. Detection of errors in training data by using a decision list and Adaboost. In Proceedings of the IJCAI Workshop on Text Learning Beyond Supervision.Google ScholarGoogle Scholar
  35. Sindhwani, V. and Keerthi, S. S. 2006. Large scale semi-supervised linear SVMs. In Proceedings of the 29th ACM International Conference on Research and Development in Information Retrieval (SIGIR’06). 477--484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Snow, R., O’Connor, B., Jurafsky, D., and Ng, A. Y. 2008. Cheap and fast - but is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’08). 254--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Vinciarelli, A. 2005. Noisy text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 27, 12, 1882--1895. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Yang, Y. 1994. Expert network: Effective and efficient learning from human decisions in text categorisation and retrieval. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR’94). 13--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Yang, Y. 1999. An evaluation of statistical approaches to text categorization. Inf. Retriev. 1, 1/2, 69--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Yih, W.-T., McCann, R., and Kolcz, A. 2007. Improving spam filtering by detecting gray mail. In Proceedings of the 4th Conference on Email and Anti-Spam (CEAS’07).Google ScholarGoogle Scholar
  41. Yokoyama, M., Matsui, T., and Ohwada, H. 2005. Detecting and revising misclassifications using ILP. In Proceedings of the 8th International Conference on Discovery Science (DS’05). 75--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Yu, K., Zhu, S., Xu, W., and Gong, Y. 2008. Non-greedy active learning for text categorization using convex transductive experimental design. In Proceedings of the 31st ACM International Conference on Research and Development in Information Retrieval (SIGIR’08). 635--642. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Zeng, X. and Martinez, T. R. 2001. An algorithm for correcting mislabeled data. Intell. Data Anal. 5, 6, 491--502. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Zhu, X. and Goldberg, A. B. 2009. Introduction to Semi-Supervised Learning. Morgan and Claypool, San Rafael, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Improving Text Classification Accuracy by Training Label Cleaning

              Recommendations

              Reviews

              Jun Ping Ng

              A large-scale study on the use of training label cleaning (TLC) to improve text classification is described in this paper. The purpose of TLC is to identify potentially mislabeled instances in a training dataset, and to flag them for closer inspection by human annotators. The underlying premise for doing this is that incorrect annotations can have a significant, adverse impact on the performance of classifiers. TLC is slightly different from active learning, where potentially useful, unlabeled instances are flagged for human annotation. The paper makes use of several well-known datasets, and examines the impact that incorrect annotations can have on classifier performance. The authors also detail three main techniques for TLC, and evaluate how these can help identify instances of incorrect annotations, resulting in improvements to text classification performance. This well-written paper was a joy to read. The experiments are extensive and sound. The authors share many useful insights into the importance of annotation integrity, and also present an illuminating discussion of the results they obtained. Readers who want to find out more about TLC may be slightly disappointed, as the paper does not go into much depth on the actual techniques used. However, TLC is already well covered in existing literature [1,2], so this is not a big problem. Some parts of the methodology and experiments could have been better structured for a more fluent read (for example, the section on using support vector machines (SVM) to refute doubts about the use of MP-Boost seems a lot like an afterthought), but the paper is worth reading nonetheless for the many observations and insights it contains. Online Computing Reviews Service

              Access critical reviews of Computing literature here

              Become a reviewer for Computing Reviews.

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM Transactions on Information Systems
                ACM Transactions on Information Systems  Volume 31, Issue 4
                November 2013
                192 pages
                ISSN:1046-8188
                EISSN:1558-2868
                DOI:10.1145/2536736
                Issue’s Table of Contents

                Copyright © 2013 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 1 November 2013
                • Accepted: 1 June 2013
                • Revised: 1 April 2013
                • Received: 1 June 2012
                Published in tois Volume 31, Issue 4

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article
                • Research
                • Refereed

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader