research-article

Combining statistics and semantics via ensemble model for document clustering

Authors:
Samah Jamal Fodeh

Michigan State University, East Lansing, MI

Michigan State University, East Lansing, MI
View Profile

,
William F Punch

Michigan State University, East Lansing, MI

Michigan State University, East Lansing, MI
View Profile

,
Pang-Ning Tan

Michigan State University, East Lansing, MI

Michigan State University, East Lansing, MI
View Profile

SAC '09: Proceedings of the 2009 ACM symposium on Applied ComputingMarch 2009Pages 1446–1450https://doi.org/10.1145/1529282.1529605

Published:08 March 2009Publication History

SAC '09: Proceedings of the 2009 ACM symposium on Applied Computing

Pages 1446–1450

ABSTRACT

Incorporating background knowledge into data mining algorithms is an important but challenging problem. Current approaches in semi-supervised learning require explicit knowledge provided by domain experts, knowledge specific to the particular data set. In this study, we propose an ensemble model that couples two sources of information: statistics information that is derived from the data set, and sense information retrieved from WordNet that is used to build a semantic binary model. We evaluated the efficacy of using our combined ensemble model on the Reuters-21578 and 20newsgroups data sets.

References

Bradley P., Bennett K., and Demiriz A., Constrained k-means clustering. Microsoft Research Technical Report, MSR-TR-2000-65, 2000.Google Scholar
Hotho A., Staab S., Stumme G, WordNet improves text document clustering. In Proc. of the SIGIR 2003 Semantic Web Workshop, 2003, 541--544. Google ScholarDigital Library
Goe J., Tan P. N., and Cheng H., Semi-supervised Clustering with Partial Background Information. In Proc. of SIAM Int'l Conf on Data Mining, Bethesda, MD 2006.Google Scholar
Mann H. B., Whitney D. R. On a test whether one of two random variables is stochastically larger than the other. Annals of Mathmatical Statistics, 18, 1947, 50--60.Google ScholarCross Ref
Miller J., WordNet: a lexical database for English, Communications of the ACM. 1995.39--41 Google ScholarDigital Library
Sedding J., Kazakov D., WordNet-based text document clustering. In Proc. of the 3^rd Workshop on Robust Methods in Analysis of Natural Language Processing Data. 2004, 104--113 Google ScholarDigital Library
Steinbach M. and Karypis G. and Kumar V., A comparison of document clustering techniques. In proc. of KDD Workshop on Text Mining, 2000.Google Scholar
Termier A., Rousset MC, Sebag M, Combining statistics and semantics for word and document clustering, In Proc. of IJCAI, 2001, 49--54.Google Scholar
Topchy A., Jain A. K., Punch W., A mixture model for clustering ensembles, In Proc. of SIAM Conference on Data Mining, 2004, 379--390.Google ScholarCross Ref
Wu Z. and Palmer M. Verb Semantics and Lexical Selection. In Proc. of the 32nd Annual Meeting of the Assoc. for Computational Linguistics, 1994, 133--138. Google ScholarDigital Library

Recommendations

Hybrid Neural Document Clustering Using Guided Self-Organization and WordNet

Document clustering is usually performed under the assumption that classification knowledge is unavailable; document classification, however, uses a classified data set for training. The supervised classification approach often achieves greater accuracy ...
Read More
Exploiting unlabeled data to enhance ensemble diversity

Ensemble learning learns from the training data by generating an ensemble of multiple base learners. It is well-known that to construct a good ensemble with strong generalization ability, the base learners are deemed to be accurate as well as diverse. ...
Read More
TextCNN-based ensemble learning model for Japanese Text Multi-classification
Abstract
In this paper, we aim at improving Japanese text classification using TextCNN-based ensemble learning model. Specifically, we first construct three different sub-classifiers, combining ALBERT, RoBERTa, DistilBERT with TextCNN, respectively; and ...
Graphical abstract

Display Omitted
Highlights
- Three TextCNN-based sub-classifiers for Japanese text classification are designed.
- A Bagging ensemble learning model is proposed to combine three different subclassifiers for multi-label Japanese text classification.
- A Japanese ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '09: Proceedings of the 2009 ACM symposium on Applied Computing
March 2009
2347 pages
ISBN:9781605581668
DOI:10.1145/1529282
Conference Chairs:
Sung Y. Shin
South Dakota State University, United States
,
Sascha Ossowski
University Rey Juan Carlos, Spain
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 March 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
WordNet
disambiguation
ensemble learning
text clustering
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,650of6,669submissions,25%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 282
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Combining statistics and semantics via ensemble model for document clustering

SAC '09: Proceedings of the 2009 ACM symposium on Applied Computing

ABSTRACT

References

Cited By

Recommendations

Hybrid Neural Document Clustering Using Guided Self-Organization and WordNet

Exploiting unlabeled data to enhance ensemble diversity

TextCNN-based ensemble learning model for Japanese Text Multi-classification