ABSTRACT
This work investigates cluster labeling enhancement by utilizing Wikipedia, the free on-line encyclopedia. We describe a general framework for cluster labeling that extracts candidate labels from Wikipedia in addition to important terms that are extracted directly from the text. The "labeling quality" of each candidate is then evaluated by several independent judges and the top evaluated candidates are recommended for labeling.
Our experimental results reveal that the Wikipedia labels agree with manual labels associated by humans to a cluster, much more than with significant terms that are extracted directly from the text. We show that in most cases even when human's associated label appears in the text, pure statistical methods have difficulty in identifying them as good descriptors. Furthermore, our experiments show that for more than 85% of the clusters in our test collection, the manual label (or an inflection, or a synonym of it) appears in the top five labels recommended by our system.
- 20 News Group (20NG) data. http://people.csail.mit.edu/jrennie/20newsgroups.Google Scholar
- T. Brants and A. Franz. Web 1T 5-gram Version 1. 2006.Google Scholar
- D. Carmel, E. Yom-Tov, A. Darlow, and D. Pelleg. What makes a query difficult? In SIGIR '06, pages 390--397. ACM Press, 2006. Google ScholarDigital Library
- O.S. Chin, N. Kulathuramaiyer, and A.W. Yeo. Automatic discovery of concepts from text. In WI '06, pages 1046--1049, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
- R. Cilibrasi and P.M.B. Vitányi. The google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19(3):370--383, 2007. Google ScholarDigital Library
- D.R. Cutting, D.R. Karger, J.O. Pedersen, and J.W. Tukey. Scatter/gather: a cluster-based approach to browsing large document collections. In SIGIR '92, pages 318--329, New York, NY, USA, 1992. ACM. Google ScholarDigital Library
- W. de Winter and M. de Rijke. Identifying facets in query-biased sets of blog posts. In ICWSM'07, pages 251--254, 2007.Google Scholar
- E. Gabrilovich and S. Markovitch. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In AAAI '06, pages 1301--1306, Boston, MA, 2006. Google ScholarDigital Library
- E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI '07, pages 1606--1611, Hyderabad, India, 2007. Google ScholarDigital Library
- F. Geraci, M. Maggini, M. Pellegrini, and F. Sebastiani. Cluster generation and cluster labelling for web snippets:a fast and accurate hierarchical solution. Internet Mathematics, 2007.Google Scholar
- E. Glover, D.M. Pennock, S. Lawrence, and R. Krovetz. Inferring hierarchical descriptions. In CIKM '02, pages 507--514, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
- J. Hu, L. Fang, Y. Cao, H.-J. Zeng, H. Li, Q. Yang, and Z. Chen. Enhancing text clustering by leveraging wikipedia semantics. In SIGIR '08, pages 179--186, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- C.D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarDigital Library
- Open Directory Project (ODP). http://www.dmoz.org/.Google Scholar
- S. Osinski and D. Weiss. A concept-driven algorithm for clustering search results. IEEE Intelligent Systems, 20(3):48--54, 2005. Google ScholarDigital Library
- D.R. Radev, H. Jing, M. Styś, and D. Tam. Centroid-based summarization of multiple documents. Information Processing Management, 40(6):919--938, 2004. Google ScholarDigital Library
- P. Schönhofen. Identifying document topics using the wikipedia category network. In WI '06, pages 456--462, 2006. Google ScholarDigital Library
- M. Strube and S.P. Ponzetto. Wikirelate! computing semantic relatedness using wikipedia. July 2006.Google Scholar
- Z.S. Syed, T. Finin, and A. Joshi. Wikipedia as an ontology for describing documents. In ICWSM '08, 2008.Google Scholar
- H. Toda and R. Kataoka. A clustering method for news articles retrieval system. In WWW '05, pages 988--989, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- P. Treeratpituk and J. Callan. Automatically labeling hierarchical clusters. In DG.O '06, pages 167--176, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
Index Terms
- Enhancing cluster labeling using wikipedia
Recommendations
A fusion approach to cluster labeling
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrievalWe present a novel approach to the cluster labeling task using fusion methods. The core idea of our approach is to weigh labels, suggested by any labeler, according to the estimated labeler's decisiveness with respect to each of its suggested labels. We ...
Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and CommunicationIn natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...
Learning multilingual named entity recognition from Wikipedia
We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Comments