skip to main content
10.1145/1571941.1571967acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Enhancing cluster labeling using wikipedia

Published:19 July 2009Publication History

ABSTRACT

This work investigates cluster labeling enhancement by utilizing Wikipedia, the free on-line encyclopedia. We describe a general framework for cluster labeling that extracts candidate labels from Wikipedia in addition to important terms that are extracted directly from the text. The "labeling quality" of each candidate is then evaluated by several independent judges and the top evaluated candidates are recommended for labeling.

Our experimental results reveal that the Wikipedia labels agree with manual labels associated by humans to a cluster, much more than with significant terms that are extracted directly from the text. We show that in most cases even when human's associated label appears in the text, pure statistical methods have difficulty in identifying them as good descriptors. Furthermore, our experiments show that for more than 85% of the clusters in our test collection, the manual label (or an inflection, or a synonym of it) appears in the top five labels recommended by our system.

References

  1. 20 News Group (20NG) data. http://people.csail.mit.edu/jrennie/20newsgroups.Google ScholarGoogle Scholar
  2. T. Brants and A. Franz. Web 1T 5-gram Version 1. 2006.Google ScholarGoogle Scholar
  3. D. Carmel, E. Yom-Tov, A. Darlow, and D. Pelleg. What makes a query difficult? In SIGIR '06, pages 390--397. ACM Press, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. O.S. Chin, N. Kulathuramaiyer, and A.W. Yeo. Automatic discovery of concepts from text. In WI '06, pages 1046--1049, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Cilibrasi and P.M.B. Vitányi. The google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19(3):370--383, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D.R. Cutting, D.R. Karger, J.O. Pedersen, and J.W. Tukey. Scatter/gather: a cluster-based approach to browsing large document collections. In SIGIR '92, pages 318--329, New York, NY, USA, 1992. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. W. de Winter and M. de Rijke. Identifying facets in query-biased sets of blog posts. In ICWSM'07, pages 251--254, 2007.Google ScholarGoogle Scholar
  8. E. Gabrilovich and S. Markovitch. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In AAAI '06, pages 1301--1306, Boston, MA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI '07, pages 1606--1611, Hyderabad, India, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. F. Geraci, M. Maggini, M. Pellegrini, and F. Sebastiani. Cluster generation and cluster labelling for web snippets:a fast and accurate hierarchical solution. Internet Mathematics, 2007.Google ScholarGoogle Scholar
  11. E. Glover, D.M. Pennock, S. Lawrence, and R. Krovetz. Inferring hierarchical descriptions. In CIKM '02, pages 507--514, New York, NY, USA, 2002. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Hu, L. Fang, Y. Cao, H.-J. Zeng, H. Li, Q. Yang, and Z. Chen. Enhancing text clustering by leveraging wikipedia semantics. In SIGIR '08, pages 179--186, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C.D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Open Directory Project (ODP). http://www.dmoz.org/.Google ScholarGoogle Scholar
  15. S. Osinski and D. Weiss. A concept-driven algorithm for clustering search results. IEEE Intelligent Systems, 20(3):48--54, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D.R. Radev, H. Jing, M. Styś, and D. Tam. Centroid-based summarization of multiple documents. Information Processing Management, 40(6):919--938, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Schönhofen. Identifying document topics using the wikipedia category network. In WI '06, pages 456--462, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Strube and S.P. Ponzetto. Wikirelate! computing semantic relatedness using wikipedia. July 2006.Google ScholarGoogle Scholar
  19. Z.S. Syed, T. Finin, and A. Joshi. Wikipedia as an ontology for describing documents. In ICWSM '08, 2008.Google ScholarGoogle Scholar
  20. H. Toda and R. Kataoka. A clustering method for news articles retrieval system. In WWW '05, pages 988--989, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Treeratpituk and J. Callan. Automatically labeling hierarchical clusters. In DG.O '06, pages 167--176, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Enhancing cluster labeling using wikipedia

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
      July 2009
      896 pages
      ISBN:9781605584836
      DOI:10.1145/1571941

      Copyright © 2009 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 July 2009

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader