research-article

Free Access

Is unlabeled data suitable for multiclass SVM-based web page classification?

Authors:
Arkaitz Zubiaga

NLP & IR Group at UNED

NLP & IR Group at UNED
View Profile

,
Víctor Fresno

NLP & IR Group at UNED

NLP & IR Group at UNED
View Profile

,
Raquel Martínez

NLP & IR Group at UNED

NLP & IR Group at UNED
View Profile

SemiSupLearn '09: Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language ProcessingJune 2009Pages 28–36

Published:04 June 2009Publication History

SemiSupLearn '09: Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing

Pages 28–36

ABSTRACT

Support Vector Machines present an interesting and effective approach to solve automated classification tasks. Although it only handles binary and supervised problems by nature, it has been transformed into multiclass and semi-supervised approaches in several works. A previous study on supervised and semi-supervised SVM classification over binary taxonomies showed how the latter clearly outperforms the former, proving the suitability of unlabeled data for the learning phase in this kind of tasks. However, the suitability of unlabeled data for multiclass tasks using SVM has never been tested before. In this work, we present a study on whether unlabeled data could improve results for multiclass web page classification tasks using Support Vector Machines. As a conclusion, we encourage to rely only on labeled data, both for improving (or at least equaling) performance and for reducing the computational cost.

References

B. E. Boser, I. Guyon and V. Vapnik. 1992. A Training Algorithm for Optimal Margin Classifiers. Proceedings of the 5th Annual Workshop on computational Learning Theory. Google ScholarDigital Library
C. Campbell. 2000. Algorithmic Approaches to Training Support Vector Machines: A Survey Proceedings of ESANN'2000, European Symposium on Artificial Neural Networks.Google Scholar
O. Chapelle, M. Chi y A. Zien 2006. A Continuation Method for Semi-supervised SVMs. Proceedings of ICML'06, the 23rd International Conference on Machine Learning. Google ScholarDigital Library
O. Chapelle, V. Sindhwani, S. Keerthi 2008. Optimization Techniques for Semi-Supervised Support Vector Machines. J. Mach. Learn. Res.. Google ScholarDigital Library
C. Cortes and V. Vapnik. 1995. Support Vector Network. Machine Learning. Google ScholarDigital Library
C.-H. Hsu and C.-J. Lin. 2002. A Comparison of Methods for Multiclass Support Vector Machines. IEEE Transactions on Neural Networks. Google ScholarDigital Library
T. Joachims. 1998. Text Categorization with Support Vector Machines: Learning with many Relevant Features. Proceedings of ECML98, 10th European Conference on Machine Learning. Google ScholarDigital Library
T. Joachims. 1999. Transductive Inference for Text Classification Using Support Vector Machines. Proceedings of ICML99, 16th International Conference on Machine Learning. Google ScholarDigital Library
J. Kivinen and E. J. Smola and R. C. Williamson. 2002. Learning with Kernels.Google Scholar
T. Mitchell. 1997. Machine Learning. McGraw Hill. Google ScholarDigital Library
H.-N. Qi, J.-G. Yang, Y.-W. Zhong y C. Deng 2004. Multi-class SVM Based Remote Sensing Image Classification and its Semi-supervised Improvement Scheme. Proceedings of the 3rd ICMLC.Google Scholar
X. Qi and B. D. Davison. 2007. Web Page Classification: Features and Algorithms. Technical Report LU-CSE-07-010.Google Scholar
B. Schölkopf and A. Smola. 1999. Advances in Kernel Methods: Support Vector Learning. MIT Press.Google Scholar
F. Sebastiani. 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys, pp. 1--47. Google ScholarDigital Library
M. P. Sinka and D. W. Corne. 2002. A New Benchmark Dataset for Web Document Clustering. Soft Computing Systems.Google Scholar
C. M. Tan, Y. F. Wang and C. D. Lee. 2002. The Use of Bigrams to Enhance Text Categorization. Information Processing and Management. Google ScholarDigital Library
J. Weston and C. Watkins. 1999. Multi-class Support Vector Machines. Proceedings of ESAAN, the European Symposium on Artificial Neural Networks.Google Scholar
L. Xu y D. Schuurmans. 2005. Unsupervised and Semi-supervised Multiclass Support Vector Machines. Proceedings of AAAI'05, the 20th National Conference on Artificial Intelligence. Google ScholarDigital Library
Z. Xu, R. Jin, J. Zhu, I. King and M. R. Lyu. 2007. Efficient Convex Optimization for Transductive Support Vector Machine. Advances in Neural Information Processing Systems.Google Scholar
Y. Yajima and T.-F. Kuo. 2006. Optimization Approaches for Semi-Supervised Multiclass Classification. Proceedings of ICDM '06 Workshops, the 6th International Conference on Data Mining. Google ScholarDigital Library

Index Terms

Is unlabeled data suitable for multiclass SVM-based web page classification?
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
  2. Modeling and simulation
    1. Model development and analysis
      1. Model verification and validation
2. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Employing unlabeled data to improve the classification performance of SVM, and its application in audio event classification

In many classification cases, the labeled samples are difficult to acquire. However, the unlabeled samples are easy to obtain. Active learning (AL) technology can be used to resolve the labeling problem. Among numerous kinds of AL algorithms, the one ...
Read More
Exploiting unlabeled data to enhance ensemble diversity

Ensemble learning learns from the training data by generating an ensemble of multiple base learners. It is well-known that to construct a good ensemble with strong generalization ability, the base learners are deemed to be accurate as well as diverse. ...
Read More
Efficient multi-class unlabeled constrained semi-supervised SVM
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Semi-supervised learning has been successfully applied to many fields such as knowledge management, information retrieval and data mining as it can utilize both labeled and unlabeled data. In this paper, we propose a general semi-supervised framework ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SemiSupLearn '09: Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing
June 2009
96 pages
ISBN:9781932432381
Program Chairs:
Qin Iris Wang
AT&T
,
Kevin Duh
University of Washington
,
Dekang Lin
Google Research
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 4 June 2009
Qualifiers
- research-article
Conference

Acceptance Rates
SemiSupLearn '09 Paper Acceptance Rate10of17submissions,59%Overall Acceptance Rate10of17submissions,59%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 274
  Total Downloads
- Downloads (Last 12 months)25
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Is unlabeled data suitable for multiclass SVM-based web page classification?

SemiSupLearn '09: Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Employing unlabeled data to improve the classification performance of SVM, and its application in audio event classification

Exploiting unlabeled data to enhance ensemble diversity

Efficient multi-class unlabeled constrained semi-supervised SVM

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Is unlabeled data suitable for multiclass SVM-based web page classification?

SemiSupLearn '09: Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Employing unlabeled data to improve the classification performance of SVM, and its application in audio event classification

Exploiting unlabeled data to enhance ensemble diversity

Efficient multi-class unlabeled constrained semi-supervised SVM

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media