skip to main content
10.1145/1529282.1529605acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Combining statistics and semantics via ensemble model for document clustering

Published:08 March 2009Publication History

ABSTRACT

Incorporating background knowledge into data mining algorithms is an important but challenging problem. Current approaches in semi-supervised learning require explicit knowledge provided by domain experts, knowledge specific to the particular data set. In this study, we propose an ensemble model that couples two sources of information: statistics information that is derived from the data set, and sense information retrieved from WordNet that is used to build a semantic binary model. We evaluated the efficacy of using our combined ensemble model on the Reuters-21578 and 20newsgroups data sets.

References

  1. Bradley P., Bennett K., and Demiriz A., Constrained k-means clustering. Microsoft Research Technical Report, MSR-TR-2000-65, 2000.Google ScholarGoogle Scholar
  2. Hotho A., Staab S., Stumme G, WordNet improves text document clustering. In Proc. of the SIGIR 2003 Semantic Web Workshop, 2003, 541--544. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Goe J., Tan P. N., and Cheng H., Semi-supervised Clustering with Partial Background Information. In Proc. of SIAM Int'l Conf on Data Mining, Bethesda, MD 2006.Google ScholarGoogle Scholar
  4. Mann H. B., Whitney D. R. On a test whether one of two random variables is stochastically larger than the other. Annals of Mathmatical Statistics, 18, 1947, 50--60.Google ScholarGoogle ScholarCross RefCross Ref
  5. Miller J., WordNet: a lexical database for English, Communications of the ACM. 1995.39--41 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Sedding J., Kazakov D., WordNet-based text document clustering. In Proc. of the 3rd Workshop on Robust Methods in Analysis of Natural Language Processing Data. 2004, 104--113 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Steinbach M. and Karypis G. and Kumar V., A comparison of document clustering techniques. In proc. of KDD Workshop on Text Mining, 2000.Google ScholarGoogle Scholar
  8. Termier A., Rousset MC, Sebag M, Combining statistics and semantics for word and document clustering, In Proc. of IJCAI, 2001, 49--54.Google ScholarGoogle Scholar
  9. Topchy A., Jain A. K., Punch W., A mixture model for clustering ensembles, In Proc. of SIAM Conference on Data Mining, 2004, 379--390.Google ScholarGoogle ScholarCross RefCross Ref
  10. Wu Z. and Palmer M. Verb Semantics and Lexical Selection. In Proc. of the 32nd Annual Meeting of the Assoc. for Computational Linguistics, 1994, 133--138. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SAC '09: Proceedings of the 2009 ACM symposium on Applied Computing
    March 2009
    2347 pages
    ISBN:9781605581668
    DOI:10.1145/1529282

    Copyright © 2009 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 8 March 2009

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate1,650of6,669submissions,25%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader