research-article

Enhancing cluster labeling using wikipedia

Authors:
David Carmel

IBM Haifa Research Lab, Haifa, Israel

IBM Haifa Research Lab, Haifa, Israel
View Profile

,
Haggai Roitman

IBM Haifa Research Lab, Haifa, Israel

IBM Haifa Research Lab, Haifa, Israel
View Profile

,
Naama Zwerdling

IBM Haifa Research Lab, Haifa, Israel

IBM Haifa Research Lab, Haifa, Israel
View Profile

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrievalJuly 2009Pages 139–146https://doi.org/10.1145/1571941.1571967

Published:19 July 2009Publication History

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Pages 139–146

ABSTRACT

This work investigates cluster labeling enhancement by utilizing Wikipedia, the free on-line encyclopedia. We describe a general framework for cluster labeling that extracts candidate labels from Wikipedia in addition to important terms that are extracted directly from the text. The "labeling quality" of each candidate is then evaluated by several independent judges and the top evaluated candidates are recommended for labeling.

Our experimental results reveal that the Wikipedia labels agree with manual labels associated by humans to a cluster, much more than with significant terms that are extracted directly from the text. We show that in most cases even when human's associated label appears in the text, pure statistical methods have difficulty in identifying them as good descriptors. Furthermore, our experiments show that for more than 85% of the clusters in our test collection, the manual label (or an inflection, or a synonym of it) appears in the top five labels recommended by our system.

References

20 News Group (20NG) data. http://people.csail.mit.edu/jrennie/20newsgroups.Google Scholar
T. Brants and A. Franz. Web 1T 5-gram Version 1. 2006.Google Scholar
D. Carmel, E. Yom-Tov, A. Darlow, and D. Pelleg. What makes a query difficult? In SIGIR '06, pages 390--397. ACM Press, 2006. Google ScholarDigital Library
O.S. Chin, N. Kulathuramaiyer, and A.W. Yeo. Automatic discovery of concepts from text. In WI '06, pages 1046--1049, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
R. Cilibrasi and P.M.B. Vitányi. The google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19(3):370--383, 2007. Google ScholarDigital Library
D.R. Cutting, D.R. Karger, J.O. Pedersen, and J.W. Tukey. Scatter/gather: a cluster-based approach to browsing large document collections. In SIGIR '92, pages 318--329, New York, NY, USA, 1992. ACM. Google ScholarDigital Library
W. de Winter and M. de Rijke. Identifying facets in query-biased sets of blog posts. In ICWSM'07, pages 251--254, 2007.Google Scholar
E. Gabrilovich and S. Markovitch. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge. In AAAI '06, pages 1301--1306, Boston, MA, 2006. Google ScholarDigital Library
E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI '07, pages 1606--1611, Hyderabad, India, 2007. Google ScholarDigital Library
F. Geraci, M. Maggini, M. Pellegrini, and F. Sebastiani. Cluster generation and cluster labelling for web snippets:a fast and accurate hierarchical solution. Internet Mathematics, 2007.Google Scholar
E. Glover, D.M. Pennock, S. Lawrence, and R. Krovetz. Inferring hierarchical descriptions. In CIKM '02, pages 507--514, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
J. Hu, L. Fang, Y. Cao, H.-J. Zeng, H. Li, Q. Yang, and Z. Chen. Enhancing text clustering by leveraging wikipedia semantics. In SIGIR '08, pages 179--186, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
C.D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarDigital Library
Open Directory Project (ODP). http://www.dmoz.org/.Google Scholar
S. Osinski and D. Weiss. A concept-driven algorithm for clustering search results. IEEE Intelligent Systems, 20(3):48--54, 2005. Google ScholarDigital Library
D.R. Radev, H. Jing, M. Styś, and D. Tam. Centroid-based summarization of multiple documents. Information Processing Management, 40(6):919--938, 2004. Google ScholarDigital Library
P. Schönhofen. Identifying document topics using the wikipedia category network. In WI '06, pages 456--462, 2006. Google ScholarDigital Library
M. Strube and S.P. Ponzetto. Wikirelate! computing semantic relatedness using wikipedia. July 2006.Google Scholar
Z.S. Syed, T. Finin, and A. Joshi. Wikipedia as an ontology for describing documents. In ICWSM '08, 2008.Google Scholar
H. Toda and R. Kataoka. A clustering method for news articles retrieval system. In WWW '05, pages 988--989, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
P. Treeratpituk and J. Callan. Automatically labeling hierarchical clusters. In DG.O '06, pages 167--176, New York, NY, USA, 2006. ACM. Google ScholarDigital Library

Index Terms

Enhancing cluster labeling using wikipedia
1. Information systems
  1. Information retrieval

Recommendations

A fusion approach to cluster labeling
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

We present a novel approach to the cluster labeling task using fusion methods. The core idea of our approach is to weigh labels, suggested by any labeler, according to the estimated labeler's decisiveness with respect to each of its suggested labels. We ...
Read More
Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication

In natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...
Read More
Learning multilingual named entity recognition from Wikipedia

We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
July 2009
896 pages
ISBN:9781605584836
DOI:10.1145/1571941
General Chairs:
James Allan
University of Massachusetts Amherst, USA
,
Javed Aslam
Northeastern University, USA
,
Program Chairs:
Mark Sanderson
University of Sheffield, UK
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Justin Zobel
University of Melbourne, Australia
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cluster labeling
wikipedia
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 102
  Total Citations
  View Citations
- 2,251
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Enhancing cluster labeling using wikipedia

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

A fusion approach to cluster labeling

Two-stage approach to named entity recognition using Wikipedia and DBpedia

Learning multilingual named entity recognition from Wikipedia