ABSTRACT
The main problems in text classification are lack of labeled data, as well as the cost of labeling the unlabeled data. We address these problems by exploring co-training - an algorithm that uses unlabeled data along with a few labeled examples to boost the performance of a classifier. We experiment with co-training on the email domain. Our results show that the performance of co-training depends on the learning algorithm it uses. In particular, Support Vector Machines significantly outperforms Naive Bayes on email classification.
- {1} Avrim Blum and Tom Mitchell. Combining Labeled and Unlabeled Data with Co-Training. In Proc. of the 11th Annual Conference on Computational Learning Theory , pages 92-100, 1998. Google ScholarDigital Library
- {2} Gary Boone. Concept Features in Re:Agent, an Intelligent Email Agent. In Proc. of the the 2nd International Conference on Autonomous Agents, pages 141-148, St. Paul, MN, USA, 1998. Google ScholarDigital Library
- {3} Jake D. Brutlag and Christopher Meek. Challenges of the Email Domain for Text Classification. In Proc. of the 17th International Conference on Machine Learning, pages 103-110, Stanford University, USA, 2000. Google ScholarDigital Library
- {4} William W. Cohen. Learning Rules that Classify Email. In Proc. of the AAAI Spring Simposium on Machine Learning in Information Access, 1996.Google Scholar
- {5} M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to Construct Knowledge Bases from the World Wide Web. Artificial Intelligence, (118):69-113, 2000. Google ScholarDigital Library
- {6} A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1-38, 1977.Google Scholar
- {7} M. A. Hall. Correlation-based Feature Subset Selection for Machine Learning. PhD thesis, University of Waikato, 1998.Google Scholar
- {8} Thorsten Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Proc. of the 10th European Conference on Machine Learning, pages 137-142, Chemnitz, Germany, 1998. Google ScholarDigital Library
- {9} Thorsten Joachims. Transductive Inference for Text Classification using Support Vector Machines. In Proc. of the 16th International Conference on Machine Learning , pages 200-209, San Francisco, USA, 1999. Google ScholarDigital Library
- {10} Thorsten Joachims. The Maximum Margin Approach to Learning Text Classifiers: Methods, Theory, and Algorithms. PhD thesis, Universität Dortmund, 2000.Google Scholar
- {11} George H. John and Pat Langley. Estimating Continuous Distributions in Bayesian Classifiers. In Proc. of the 11th Conference on Uncertainty in Artificial Intelligence , pages 338-345, Montreal, Quebec, Canada, 1995. Morgan Kaufmann. Google ScholarDigital Library
- {12} Ion Muslea, Steven Minton, and Craig A. Knoblock. Selective Sampling + Semi-Supervised Learning = Robust Multi-View Learning. In IJCAI-2001 Workshop "Text Learning: Beyond Supervision ", 2001.Google Scholar
- {13} Kamal Nigam and Rayid Ghani. Analyzing the Effectiveness and Applicability of Co-training. In Proc. of the 9th International Conference on Information Knowledge Management, pages 86-93, McLean, VA, USA, 2000. Google ScholarDigital Library
- {14} David Pierce and Claire Cardie. Limitations of Co-Training for Natural Language Learning from Large Datasets. In Proc. of the 2001 Conference on Empirical Methods in Natural Language Processing, CMU, Pittsburgh, PA, USA, 2001.Google Scholar
- {15} M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian Approach to Filtering Junk E-mail. In AAAI-98 Workshop on Learning for Text Categorization, Madison, Wisconsin, USA, 1998.Google Scholar
- {16} Richard B. Segal and Jeffrey O. Kephart. MailCat: An Intelligent Assistant for Organizing E-Mail. In Proc. of the Sixteenth National Conference on Artificial Intelligence , pages 925-926, Orlando, Florida, USA, 1999. Google ScholarDigital Library
- {17} L. Valiant. A Theory of the Learnable. Communications of the ACM, 27(11):1134-1142, 1984. Google ScholarDigital Library
- {18} Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995. Google ScholarDigital Library
- {19} Werner Winiwarter. PEA - a Personal Email Assistant with Evolutionary Adaptation. International Journal of Information Technology, 5(1), 1999.Google Scholar
- {20} Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations . Morgan Kaufmann, 1999. http://www.cs.waikato.ac.nz/ml/weka/. Google ScholarDigital Library
- {21} Sarah Zelikovitz and Haym Hirsh. Improving Short-Text Classification using Unlabeled Background Knowledge to Assess Document Similarity. In Proc. of the 17th International Conference on Machine Learning, Stanford University, USA, 2000. Google ScholarDigital Library
Index Terms
- Email classification with co-training
Recommendations
Email classification with co-training
CASCON '11: Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative ResearchThe main problems in text classification are lack of labeled data, as well as the cost of labeling the unlabeled data. We address these problems by exploring co-training - an algorithm that uses unlabeled data along with a few labeled examples to boost ...
DCPE co-training for classification
Co-training is a well-known semi-supervised learning technique that applies two basic learners to train the data source, which uses the most confident unlabeled data to augment labeled data in the learning process. In the paper, we use the diversity of ...
Improving Text Classification Accuracy by Training Label Cleaning
In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain. Semisupervised learning and active learning are two strategies whose aim is maximizing the effectiveness of the resulting ...
Comments