ABSTRACT
Incorporating background knowledge into data mining algorithms is an important but challenging problem. Current approaches in semi-supervised learning require explicit knowledge provided by domain experts, knowledge specific to the particular data set. In this study, we propose an ensemble model that couples two sources of information: statistics information that is derived from the data set, and sense information retrieved from WordNet that is used to build a semantic binary model. We evaluated the efficacy of using our combined ensemble model on the Reuters-21578 and 20newsgroups data sets.
- Bradley P., Bennett K., and Demiriz A., Constrained k-means clustering. Microsoft Research Technical Report, MSR-TR-2000-65, 2000.Google Scholar
- Hotho A., Staab S., Stumme G, WordNet improves text document clustering. In Proc. of the SIGIR 2003 Semantic Web Workshop, 2003, 541--544. Google ScholarDigital Library
- Goe J., Tan P. N., and Cheng H., Semi-supervised Clustering with Partial Background Information. In Proc. of SIAM Int'l Conf on Data Mining, Bethesda, MD 2006.Google Scholar
- Mann H. B., Whitney D. R. On a test whether one of two random variables is stochastically larger than the other. Annals of Mathmatical Statistics, 18, 1947, 50--60.Google ScholarCross Ref
- Miller J., WordNet: a lexical database for English, Communications of the ACM. 1995.39--41 Google ScholarDigital Library
- Sedding J., Kazakov D., WordNet-based text document clustering. In Proc. of the 3rd Workshop on Robust Methods in Analysis of Natural Language Processing Data. 2004, 104--113 Google ScholarDigital Library
- Steinbach M. and Karypis G. and Kumar V., A comparison of document clustering techniques. In proc. of KDD Workshop on Text Mining, 2000.Google Scholar
- Termier A., Rousset MC, Sebag M, Combining statistics and semantics for word and document clustering, In Proc. of IJCAI, 2001, 49--54.Google Scholar
- Topchy A., Jain A. K., Punch W., A mixture model for clustering ensembles, In Proc. of SIAM Conference on Data Mining, 2004, 379--390.Google ScholarCross Ref
- Wu Z. and Palmer M. Verb Semantics and Lexical Selection. In Proc. of the 32nd Annual Meeting of the Assoc. for Computational Linguistics, 1994, 133--138. Google ScholarDigital Library
Recommendations
Hybrid Neural Document Clustering Using Guided Self-Organization and WordNet
Document clustering is usually performed under the assumption that classification knowledge is unavailable; document classification, however, uses a classified data set for training. The supervised classification approach often achieves greater accuracy ...
Exploiting unlabeled data to enhance ensemble diversity
Ensemble learning learns from the training data by generating an ensemble of multiple base learners. It is well-known that to construct a good ensemble with strong generalization ability, the base learners are deemed to be accurate as well as diverse. ...
TextCNN-based ensemble learning model for Japanese Text Multi-classification
AbstractIn this paper, we aim at improving Japanese text classification using TextCNN-based ensemble learning model. Specifically, we first construct three different sub-classifiers, combining ALBERT, RoBERTa, DistilBERT with TextCNN, respectively; and ...
Graphical abstractDisplay Omitted
Highlights- Three TextCNN-based sub-classifiers for Japanese text classification are designed.
- A Bagging ensemble learning model is proposed to combine three different subclassifiers for multi-label Japanese text classification.
- A Japanese ...
Comments