Article

SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

Authors:
Stephen Dill

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Nadav Eiron

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
David Gibson

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Daniel Gruhl

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
R. Guha

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Anant Jhingran

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Tapas Kanungo

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Sridhar Rajagopalan

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Andrew Tomkins

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
John A. Tomlin

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Jason Y. Zien

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

WWW '03: Proceedings of the 12th international conference on World Wide WebMay 2003Pages 178–186https://doi.org/10.1145/775152.775178

Published:20 May 2003Publication History

WWW '03: Proceedings of the 12th international conference on World Wide Web

Pages 178–186

ABSTRACT

This paper describes Seeker, a platform for large-scale text analytics, and SemTag, an application written on the platform to perform automated semantic tagging of large corpora. We apply SemTag to a collection of approximately 264 million web pages, and generate approximately 434 million automatically disambiguated semantic tags, published to the web as a label bureau providing metadata regarding the 434 million annotations. To our knowledge, this is the largest scale semantic tagging effort to date.We describe the Seeker platform, discuss the architecture of the SemTag application, describe a new disambiguation algorithm specialized to support ontological disambiguation of large-scale data, evaluate the algorithm, and present our final results with information about acquiring and making use of the semantic tags. We argue that automated large scale semantic tagging of ambiguous content can bootstrap and accelerate the creation of the semantic web.

References

S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The lorel query language for semistructured data. International Journal of Digital Libraries, 1(1):68--88, 1997.Google ScholarCross Ref
R. Agrawal, R. Bayardo, D. Gruhl, and S. Papadimitriou. Vinci: A service-oriented architecture for rapid development of web applications. In Proceedings of the Tenth International World Wide Web Conference (WWW2001), pages 355--365, Hong Kong, China, 2001. Google ScholarDigital Library
AltaVista. http://www.altavista.com.Google Scholar
G. Arocena, A. Mendelzon, and G. Mihaila. Applications of a Web query language. In Proceedings of the 6th International World Wide Web Conference (WWW1997), pages 1305--1315, Santa Clara, CA, 1997. Google ScholarDigital Library
T. Berners-Lee, J. Hendler, and O. Lassila. Semantic web. Scientific American, 1(1):68--88, 2000.Google Scholar
D. Box, D. Ehnebuske, G. Kakivaya, A. Layman, N. Mendelsohn, H. F. Nielsen, S. Thatte, and D. Winder. Simple Object Access Protocol. http://www.w3.org/TR/SOAP/, May 2000.Google Scholar
D. Brickley and R.V.Guha. Rdf schema. http://www.w3.org/TR/rdf-schema/.Google Scholar
A. Broder and M. R. Henzinger. Algorithmic aspects of information retrieval on the web. In M. G. C. R. J. Abello, P. M. Pardalos, editor, Handbook of Massive Data Sets. Kluwer Academic Publishers, Boston, to appear. Google ScholarDigital Library
C. Clarke, G. Cormack, and F. Burkowski. Shortest substring ranking. In Proceedings of the Fourth Text Retrieval Conference, pages 295--304, Gaithersburg, MD, November 1995.Google Scholar
W. Cohen and L. Jensen. A structured wrapper induction system for extracting information from semi-structured documents. In Proceedings of the Workshop on Adaptive Text Extraction and Mining (IJCAI'01), 2001.Google Scholar
M. Erdmann, A. Maedche, H. Schnurr, and S. Staab. From manual to semi-automatic semantic annotation: About ontology-based text annotation tools. In P. Buitelaar and K. Hasida, editors, Proceedings of the COLING 2000 Workshop on Semantic Annotation and Intelligent Content, August 2000. Google ScholarDigital Library
Google. http://www.google.com.Google Scholar
T. R. Gruber. Towards Principles for the Design of Ontologies Used for Knowledge Sharing. In N. Guarino and R. Poli, editors, Formal Ontology in Conceptual Analysis and Knowledge Representation, Deventer, The Netherlands, 1993. Kluwer Academic Publishers.Google Scholar
J. Heflin and J. Hendler. Searching the web with shoe. In AAAI-2000 Workshop on AI for Web Search, 2000.Google ScholarCross Ref
J. M. Hellerstein, M. J. Franklin, S. Chandrasekaran, A. Deshpande, K. Hilldrum, D. Maden, V. Raman, and M. A. Shah. Adaptive query processing: Technology in evolution. IEEE Data Engineering Bulletin, 23(2):7--18, 2000.Google Scholar
J. Hirai, S. Raghavan, A. Paepcke, and H. Garcia-Molina. WebBase: A repository of Web pages. In Proceedings of the 9th International World Wide Web Conference (WWW2000), pages 277-293, Amsterdam, The Netherlands, 2000. Google ScholarDigital Library
J. Kahan and M.-R. Koivunen. Annotea: an open RDF infrastructure for shared web annotations. In World Wide Web, pages 623--632, 2001. Google ScholarDigital Library
N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In Intl. Joint Conference on Artificial Intelligence (IJCAI), pages 729--737, 1997.Google Scholar
T. Leonard and H. Glaser. Large scale acquisition and maintenance from the web without source access. http://semannot2001.aifb.uni-karlsruhe.de/positionpapers/Leonard.pdf, 2001.Google Scholar
K. Lerman, C. Knoblock, and S. Minton. Automatic data extraction from lists and tables in web sources. In IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, August 2001.Google Scholar
G.-A. Levow. Corpus-based techniques for word sense disambiguation. Technical Report AIM-1637, MIT AI Lab, 1, 1997. Google ScholarDigital Library
J. Li, L. Zhang, and Y. Yu. Learning to generate semantic annotation for domain specific sentences. http://semannot2001.aifb.uni-karlsruhe.de/positionpapers/GenerateSemAnnot.pdf.Google Scholar
P. K. Lockheed. AeroDAML: Applying information extraction to generate DAML annotations from web pages.Google Scholar
D. L. McGuinness. Description logics emerge from ivory towers. In Description Logics, 2001.Google Scholar
G. Mecca, A. Mendelzon, and P. Merialdo. Efficient queries over web views. In Proceedings of the 6th International Conference on Extending Database Technology (EDBT'98), volume LNCS 1377, pages 72--86, Valencia, Spain, 1998. Springer-Verlag. Google ScholarDigital Library
R. Mihalcea. Word sense disambiguation and its application to the internet search. Master's thesis, Southern Methodist University, 1999.Google Scholar
A. Newell. Some problems of the basic organization in problem-solving programs. In Proceedings of the Second Conference on Self-Organizing Systems, pages 393--423, Washington, DC, 1962.Google Scholar
N. F. Noy, M. Sintek, S. Decker, M. Crubezy, R. W. Fergerson, and M. A. Musen. Creating semantic web contents with protege-2000. IEEE Intelligent Systems, 2(16):60--71, 2001. Google ScholarDigital Library
J. Pustejovsky, B. Boguraev, M. Verhagen, P. Buitelaar, and M. Johnston. Semantic indexing and typed hyperlinking. In Proceedings of the American Association for Artificial Intelligence Conference, Spring Symposium, NLP for WWW, pages 120--128, 1997.Google Scholar
R.Guha and R. McCool. Tap: Towards a web of data. http://tap.stanford.edu/.Google Scholar
E. Riloff and J. Shepherd. A corpus-based approach for building semantic lexicons. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP-97), pages 117--124, Providence, RI, 1997.Google Scholar
H. Schütze. Automatic word sense discrimination. Computational Linguistics, 24(1):97--124, 1998. Google ScholarDigital Library
E. Spertus and L. A. Stein. Squeal: A structured query language for the web. In Proceedings of the 9th International World Wide Web Conference (WWW2000), pages 95--103, Amsterdam, The Netherlands, 2000. Google ScholarDigital Library
S. Staab, A. Maedche, and S. Handschuh. An annotation framework for the semantic web. In S. Isjizaki, editor, Proceedings of the First Workshop on Multimedia Annotation, Tokyo, Japan, January 2001.Google Scholar
The Internet Archive. http://www.archive.org.Google Scholar
M. Vargas-Vera, E. Motta, J. Domingue, M. Lanzoni, A. Stutt, and F. Ciravegna. MnM: Ontology driven semi-automatic and automatic support for semantic markup. In The 13th International Conference on Knowledge Engineering and Management (EKAW 2002), 2002. Google ScholarDigital Library
W3C. Platform for internet content selection. http://www.w3.org/PICS/.Google Scholar
W3C. Web ontology language. http://www.w3.org/2001/sw/WebOnt/.Google Scholar
Web-in-a-Box. http://research.compaq.com/SRC/WebArcheology/wib.html.Google Scholar
Y. Wilks and M. Stevenson. Sense tagging: Semantic tagging with a lexicon. In Proceedings of the SIGLEX Workshop Tagging Text with Lexical Semantics: What, why and how?, pages 47--51, 1997.Google Scholar

Index Terms

SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

Recommendations

A large scale, corpus-based approach for automatically disambiguating biomedical abbreviations

Abbreviations and acronyms are widely used in the biomedical literature and many of them represent important biomedical concepts. Because many abbreviations are ambiguous (e.g., CAT denotes both chloramphenicol acetyl transferase and computed axial ...
Read More
Event Search and Analytics: Detecting Events in Semantically Annotated Corpora for Search & Analytics
WSDM '16: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining

In this article, I present the questions that I seek to answer in my PhD research. I posit to analyze natural language text with the help of semantic annotations and mine important events for navigating large text corpora. Semantic annotations such as ...
Read More
Application of association rules mining to Named Entity Recognition and co-reference resolution for the Indonesian language

In this paper, we propose a new method, association rules mining for Named Entity Recognition (NER) and co-reference resolution. The method uses several morphological and lexical features such as Pronoun Class (PC) and Name Class (NC), String Similarity ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '03: Proceedings of the 12th international conference on World Wide Web
May 2003
772 pages
ISBN:1581136803
DOI:10.1145/775152
Conference Chairs:
Gusztáv Hencsey
MTA SZTAKI, Hungary
,
Bebo White
Stanford Linear Accelerator Center, USA
,
Program Chairs:
Yih-Farn Robin Chen
AT&T Labs -- Research, USA
,
László Kovács
MTA SZTAKI, Hungary
,
Steve Lawrence
Google Inc., USA
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 May 2003
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
automated semantic tagging
data mining
information retrieval
large text datasets
text analytics
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 242
  Total Citations
  View Citations
- 3,241
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

WWW '03: Proceedings of the 12th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

A large scale, corpus-based approach for automatically disambiguating biomedical abbreviations

Event Search and Analytics: Detecting Events in Semantically Annotated Corpora for Search & Analytics

Application of association rules mining to Named Entity Recognition and co-reference resolution for the Indonesian language

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

WWW '03: Proceedings of the 12th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

A large scale, corpus-based approach for automatically disambiguating biomedical abbreviations

Event Search and Analytics: Detecting Events in Semantically Annotated Corpora for Search & Analytics

Application of association rules mining to Named Entity Recognition and co-reference resolution for the Indonesian language

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media