Abstract
In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain. Semisupervised learning and active learning are two strategies whose aim is maximizing the effectiveness of the resulting classifiers for a given amount of training effort. Both strategies have been actively investigated for TC in recent years. Much less research has been devoted to a third such strategy, training label cleaning (TLC), which consists in devising ranking functions that sort the original training examples in terms of how likely it is that the human annotator has mislabelled them. This provides a convenient means for the human annotator to revise the training set so as to improve its quality. Working in the context of boosting-based learning methods for multilabel classification we present three different techniques for performing TLC and, on three widely used TC benchmarks, evaluate them by their capability of spotting training documents that, for experimental reasons only, we have purposefully mislabelled. We also evaluate the degradation in classification effectiveness that these mislabelled texts bring about, and to what extent training label cleaning can prevent this degradation.
- Abney, S., Schapire, R. E., and Singer, Y. 1999. Boosting applied to tagging and PP attachment. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC’99). 38--45.Google Scholar
- Agarwal, S., Godbole, S., Punjani, D., and Roy, S. 2007. How much noise is too much: A study in automatic text classification. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM’07). 3--12. Google ScholarDigital Library
- Argamon-Engelson, S. and Dagan, I. 1999. Committee-based sample selection for probabilistic classifiers. J. Artif. Intell. Res. 11, 335--360.Google ScholarCross Ref
- Breiman, L. 1996. Bagging predictors. Machine Learning 24, 2, 123--140. Google ScholarDigital Library
- Brodley, C. E. and Friedl, M. A. 1996. Identifying and eliminating mislabeled training instances. In Proceedings of the 13th Conference of the American Association for Artificial Intelligence (AAAI’96). 799--805. Google ScholarDigital Library
- Chapelle, O., Schölkopf, B., and Zien, A., Eds. 2006. Semi-Supervised Learning. MIT Press, Cambridge, MA.Google Scholar
- Cohn, D., Atlas, L., and Ladner, R. 1994. Improving generalization with active learning. Machine Learn. 15, 2, 201--221. Google ScholarDigital Library
- Dickinson, M. and Meurers, W. D. 2003. Detecting errors in part-of-speech annotation. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL’03). 107--114. Google ScholarDigital Library
- Dietterich, T. G. 2000. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learn. 40, 2, 139--157. Google ScholarDigital Library
- Eskin, E. 2000. Detecting errors within a corpus using anomaly detection. In Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’00). 148--153. Google ScholarDigital Library
- Esuli, A. and Sebastiani, F. 2009. Training data cleaning for text classification. In Proceedings of the 2nd International Conference on the Theory of Information Retrieval (ICTIR’09). 29--41. Google ScholarDigital Library
- Esuli, A. and Sebastiani, F. 2010. Machines that learn how to code open-ended survey data. Int. J. Market Res. 52, 6, 775--800.Google ScholarCross Ref
- Esuli, A., Fagni, T., and Sebastiani, F. 2006. MP-Boost: A multiple-pivot boosting algorithm and its application to text categorization. In Proceedings of the 13th International Symposium on String Processing and Information Retrieval (SPIRE’06). 1--12. Google ScholarDigital Library
- Freund, Y., Seung, H. S., Shamir, E., and Tishby, N. 1992. Information, prediction, and query by committee. In Advances in Neural Information Processing Systems, Vol. 5, MIT Press, Cambridge, MA, 483--490. Google ScholarDigital Library
- Friedman, J., Hastie, T., and Tibshirani, R. J. 2000. Additive logistic regression: A statistical view of boosting. Ann. Statist. 2, 337--374.Google ScholarCross Ref
- Fukumoto, F. and Suzuki, Y. 2004. Correcting category errors in text classification. In Proceedings of the 20th International Conference on Computational Linguistics (COLING’04). 868--874. Google ScholarDigital Library
- Galavotti, L., Sebastiani, F., and Simi, M. 2000. Experiments on the use of feature selection and negative evidence in automated text categorization. In Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries (ECDL’00). 59--68. Google ScholarDigital Library
- Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Comput. 4, 1, 1--58. Google ScholarDigital Library
- Grady, C. and Lease, M. 2010. Crowdsourcing document relevance assessment with Mechanical fiTurk. In Proceedings of the NAACL HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 172--179. Google ScholarDigital Library
- Hersh, W., Buckley, C., Leone, T., and Hickman, D. 1994. OHSUMED: An interactive retrieval evaluation and new large text collection for research. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR’94). 192--201. Google ScholarDigital Library
- Järvelin, K. and Kekäläinen, J. 2000. IR evaluation methods for retrieving highly relevant documents. In Proceedings of the 23rd ACM International Conference on Research and Development in Information Retrieval (SIGIR’00). 41--48. Google ScholarDigital Library
- John, G. H. 1995. Robust decision trees: Removing outliers from databases. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (KDD’95). 174--179.Google Scholar
- Lewis, D. D. 2004. Reuters-21578 text categorization test collection Distribution 1.0 README file (v 1.3). http://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt.Google Scholar
- Lewis, D. D., Schapire, R. E., Callan, J. P., and Papka, R. 1996. Training algorithms for linear text classifiers. In Proceedings of the 19th ACM International Conference on Research and Development in Information Retrieval (SIGIR’96). 298--306. Google ScholarDigital Library
- Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. 2004. RCV1: A new benchmark collection for text categorization research. J. Machine Learn. Res. 5, 361--397. Google ScholarDigital Library
- Maclin, R. and Opitz, D. W. 1997. An empirical evaluation of bagging and boosting. In Proceedings of the 14th Conference of the American Association for Artificial Intelligence (AAAI’97). 546--551. Google ScholarDigital Library
- Malik, H. H. and Bhardwaj, V. S. 2011. Automatic training data cleaning for text classification. In Proceedings of the ICDM Workshop on Domain-Driven Data Mining. 442--449. Google ScholarDigital Library
- Murata, M., Utiyama, M., Uchimoto, K., Isahara, H., and Ma, Q. 2005. Correction of errors in a verb modality corpus for machine translation with a machine-learning method. ACM Trans. Asian Lang. Inform. Process. 4, 1, 18--37. Google ScholarDigital Library
- Nakagawa, T. and Matsumoto, Y. 2002. Detecting errors in corpora using support vector machines. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02). 1--7. Google ScholarDigital Library
- Resta, G. 2012. On the expected average precision of the random ranker. Tech. rep. IIT TR-04/2012, Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, Pisa, IT. http://www.iit.cnr.it/sites/default/files/TR-04-2012.pdf.Google Scholar
- Schapire, R. and Singer, Y. 1999. Improved boosting using confidence-rated predictions. Machine Learn. 37, 3, 297--336. Google ScholarDigital Library
- Schapire, R. E. and Singer, Y. 2000. Boostexter: A boosting-based system for text categorization. Machine Learn. 39, 2/3, 135--168. Google ScholarDigital Library
- Schapire, R. E. and Freund, Y. 2012. Boosting: Foundations and Algorithms. MIT Press, Cambridge, MA. Google ScholarCross Ref
- Shinnou, H. 2001. Detection of errors in training data by using a decision list and Adaboost. In Proceedings of the IJCAI Workshop on Text Learning Beyond Supervision.Google Scholar
- Sindhwani, V. and Keerthi, S. S. 2006. Large scale semi-supervised linear SVMs. In Proceedings of the 29th ACM International Conference on Research and Development in Information Retrieval (SIGIR’06). 477--484. Google ScholarDigital Library
- Snow, R., O’Connor, B., Jurafsky, D., and Ng, A. Y. 2008. Cheap and fast - but is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’08). 254--263. Google ScholarDigital Library
- Vinciarelli, A. 2005. Noisy text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 27, 12, 1882--1895. Google ScholarDigital Library
- Yang, Y. 1994. Expert network: Effective and efficient learning from human decisions in text categorisation and retrieval. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR’94). 13--22. Google ScholarDigital Library
- Yang, Y. 1999. An evaluation of statistical approaches to text categorization. Inf. Retriev. 1, 1/2, 69--90. Google ScholarDigital Library
- Yih, W.-T., McCann, R., and Kolcz, A. 2007. Improving spam filtering by detecting gray mail. In Proceedings of the 4th Conference on Email and Anti-Spam (CEAS’07).Google Scholar
- Yokoyama, M., Matsui, T., and Ohwada, H. 2005. Detecting and revising misclassifications using ILP. In Proceedings of the 8th International Conference on Discovery Science (DS’05). 75--80. Google ScholarDigital Library
- Yu, K., Zhu, S., Xu, W., and Gong, Y. 2008. Non-greedy active learning for text categorization using convex transductive experimental design. In Proceedings of the 31st ACM International Conference on Research and Development in Information Retrieval (SIGIR’08). 635--642. Google ScholarDigital Library
- Zeng, X. and Martinez, T. R. 2001. An algorithm for correcting mislabeled data. Intell. Data Anal. 5, 6, 491--502. Google ScholarDigital Library
- Zhu, X. and Goldberg, A. B. 2009. Introduction to Semi-Supervised Learning. Morgan and Claypool, San Rafael, CA. Google ScholarDigital Library
Index Terms
- Improving Text Classification Accuracy by Training Label Cleaning
Recommendations
Effective multi-label active learning for text classification
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data miningLabeling text data is quite time-consuming but essential for automatic text classification. Especially, manually creating multiple labels for each document may become impractical when a very large amount of data is needed for training multi-label text ...
Automatic Training Data Cleaning for Text Classification
ICDMW '11: Proceedings of the 2011 IEEE 11th International Conference on Data Mining WorkshopsSupervised text classification algorithms rely on the availability of large quantities of quality training data to achieve their optimal performance. However, not all training data is created equal and the quality of class-labels assigned by human ...
Vertical Ensemble Co-Training for Text Classification
Regular PapersHigh-quality, labeled data is essential for successfully applying machine learning methods to real-world text classification problems. However, in many cases, the amount of labeled data is very small compared to that of the unlabeled, and labeling ...
Comments