ABSTRACT
This paper describes Seeker, a platform for large-scale text analytics, and SemTag, an application written on the platform to perform automated semantic tagging of large corpora. We apply SemTag to a collection of approximately 264 million web pages, and generate approximately 434 million automatically disambiguated semantic tags, published to the web as a label bureau providing metadata regarding the 434 million annotations. To our knowledge, this is the largest scale semantic tagging effort to date.We describe the Seeker platform, discuss the architecture of the SemTag application, describe a new disambiguation algorithm specialized to support ontological disambiguation of large-scale data, evaluate the algorithm, and present our final results with information about acquiring and making use of the semantic tags. We argue that automated large scale semantic tagging of ambiguous content can bootstrap and accelerate the creation of the semantic web.
- S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The lorel query language for semistructured data. International Journal of Digital Libraries, 1(1):68--88, 1997.Google ScholarCross Ref
- R. Agrawal, R. Bayardo, D. Gruhl, and S. Papadimitriou. Vinci: A service-oriented architecture for rapid development of web applications. In Proceedings of the Tenth International World Wide Web Conference (WWW2001), pages 355--365, Hong Kong, China, 2001. Google ScholarDigital Library
- AltaVista. http://www.altavista.com.Google Scholar
- G. Arocena, A. Mendelzon, and G. Mihaila. Applications of a Web query language. In Proceedings of the 6th International World Wide Web Conference (WWW1997), pages 1305--1315, Santa Clara, CA, 1997. Google ScholarDigital Library
- T. Berners-Lee, J. Hendler, and O. Lassila. Semantic web. Scientific American, 1(1):68--88, 2000.Google Scholar
- D. Box, D. Ehnebuske, G. Kakivaya, A. Layman, N. Mendelsohn, H. F. Nielsen, S. Thatte, and D. Winder. Simple Object Access Protocol. http://www.w3.org/TR/SOAP/, May 2000.Google Scholar
- D. Brickley and R.V.Guha. Rdf schema. http://www.w3.org/TR/rdf-schema/.Google Scholar
- A. Broder and M. R. Henzinger. Algorithmic aspects of information retrieval on the web. In M. G. C. R. J. Abello, P. M. Pardalos, editor, Handbook of Massive Data Sets. Kluwer Academic Publishers, Boston, to appear. Google ScholarDigital Library
- C. Clarke, G. Cormack, and F. Burkowski. Shortest substring ranking. In Proceedings of the Fourth Text Retrieval Conference, pages 295--304, Gaithersburg, MD, November 1995.Google Scholar
- W. Cohen and L. Jensen. A structured wrapper induction system for extracting information from semi-structured documents. In Proceedings of the Workshop on Adaptive Text Extraction and Mining (IJCAI'01), 2001.Google Scholar
- M. Erdmann, A. Maedche, H. Schnurr, and S. Staab. From manual to semi-automatic semantic annotation: About ontology-based text annotation tools. In P. Buitelaar and K. Hasida, editors, Proceedings of the COLING 2000 Workshop on Semantic Annotation and Intelligent Content, August 2000. Google ScholarDigital Library
- Google. http://www.google.com.Google Scholar
- T. R. Gruber. Towards Principles for the Design of Ontologies Used for Knowledge Sharing. In N. Guarino and R. Poli, editors, Formal Ontology in Conceptual Analysis and Knowledge Representation, Deventer, The Netherlands, 1993. Kluwer Academic Publishers.Google Scholar
- J. Heflin and J. Hendler. Searching the web with shoe. In AAAI-2000 Workshop on AI for Web Search, 2000.Google ScholarCross Ref
- J. M. Hellerstein, M. J. Franklin, S. Chandrasekaran, A. Deshpande, K. Hilldrum, D. Maden, V. Raman, and M. A. Shah. Adaptive query processing: Technology in evolution. IEEE Data Engineering Bulletin, 23(2):7--18, 2000.Google Scholar
- J. Hirai, S. Raghavan, A. Paepcke, and H. Garcia-Molina. WebBase: A repository of Web pages. In Proceedings of the 9th International World Wide Web Conference (WWW2000), pages 277-293, Amsterdam, The Netherlands, 2000. Google ScholarDigital Library
- J. Kahan and M.-R. Koivunen. Annotea: an open RDF infrastructure for shared web annotations. In World Wide Web, pages 623--632, 2001. Google ScholarDigital Library
- N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In Intl. Joint Conference on Artificial Intelligence (IJCAI), pages 729--737, 1997.Google Scholar
- T. Leonard and H. Glaser. Large scale acquisition and maintenance from the web without source access. http://semannot2001.aifb.uni-karlsruhe.de/positionpapers/Leonard.pdf, 2001.Google Scholar
- K. Lerman, C. Knoblock, and S. Minton. Automatic data extraction from lists and tables in web sources. In IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, August 2001.Google Scholar
- G.-A. Levow. Corpus-based techniques for word sense disambiguation. Technical Report AIM-1637, MIT AI Lab, 1, 1997. Google ScholarDigital Library
- J. Li, L. Zhang, and Y. Yu. Learning to generate semantic annotation for domain specific sentences. http://semannot2001.aifb.uni-karlsruhe.de/positionpapers/GenerateSemAnnot.pdf.Google Scholar
- P. K. Lockheed. AeroDAML: Applying information extraction to generate DAML annotations from web pages.Google Scholar
- D. L. McGuinness. Description logics emerge from ivory towers. In Description Logics, 2001.Google Scholar
- G. Mecca, A. Mendelzon, and P. Merialdo. Efficient queries over web views. In Proceedings of the 6th International Conference on Extending Database Technology (EDBT'98), volume LNCS 1377, pages 72--86, Valencia, Spain, 1998. Springer-Verlag. Google ScholarDigital Library
- R. Mihalcea. Word sense disambiguation and its application to the internet search. Master's thesis, Southern Methodist University, 1999.Google Scholar
- A. Newell. Some problems of the basic organization in problem-solving programs. In Proceedings of the Second Conference on Self-Organizing Systems, pages 393--423, Washington, DC, 1962.Google Scholar
- N. F. Noy, M. Sintek, S. Decker, M. Crubezy, R. W. Fergerson, and M. A. Musen. Creating semantic web contents with protege-2000. IEEE Intelligent Systems, 2(16):60--71, 2001. Google ScholarDigital Library
- J. Pustejovsky, B. Boguraev, M. Verhagen, P. Buitelaar, and M. Johnston. Semantic indexing and typed hyperlinking. In Proceedings of the American Association for Artificial Intelligence Conference, Spring Symposium, NLP for WWW, pages 120--128, 1997.Google Scholar
- R.Guha and R. McCool. Tap: Towards a web of data. http://tap.stanford.edu/.Google Scholar
- E. Riloff and J. Shepherd. A corpus-based approach for building semantic lexicons. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP-97), pages 117--124, Providence, RI, 1997.Google Scholar
- H. Schütze. Automatic word sense discrimination. Computational Linguistics, 24(1):97--124, 1998. Google ScholarDigital Library
- E. Spertus and L. A. Stein. Squeal: A structured query language for the web. In Proceedings of the 9th International World Wide Web Conference (WWW2000), pages 95--103, Amsterdam, The Netherlands, 2000. Google ScholarDigital Library
- S. Staab, A. Maedche, and S. Handschuh. An annotation framework for the semantic web. In S. Isjizaki, editor, Proceedings of the First Workshop on Multimedia Annotation, Tokyo, Japan, January 2001.Google Scholar
- The Internet Archive. http://www.archive.org.Google Scholar
- M. Vargas-Vera, E. Motta, J. Domingue, M. Lanzoni, A. Stutt, and F. Ciravegna. MnM: Ontology driven semi-automatic and automatic support for semantic markup. In The 13th International Conference on Knowledge Engineering and Management (EKAW 2002), 2002. Google ScholarDigital Library
- W3C. Platform for internet content selection. http://www.w3.org/PICS/.Google Scholar
- W3C. Web ontology language. http://www.w3.org/2001/sw/WebOnt/.Google Scholar
- Web-in-a-Box. http://research.compaq.com/SRC/WebArcheology/wib.html.Google Scholar
- Y. Wilks and M. Stevenson. Sense tagging: Semantic tagging with a lexicon. In Proceedings of the SIGLEX Workshop Tagging Text with Lexical Semantics: What, why and how?, pages 47--51, 1997.Google Scholar
Index Terms
- SemTag and seeker: bootstrapping the semantic web via automated semantic annotation
Recommendations
A large scale, corpus-based approach for automatically disambiguating biomedical abbreviations
Abbreviations and acronyms are widely used in the biomedical literature and many of them represent important biomedical concepts. Because many abbreviations are ambiguous (e.g., CAT denotes both chloramphenicol acetyl transferase and computed axial ...
Event Search and Analytics: Detecting Events in Semantically Annotated Corpora for Search & Analytics
WSDM '16: Proceedings of the Ninth ACM International Conference on Web Search and Data MiningIn this article, I present the questions that I seek to answer in my PhD research. I posit to analyze natural language text with the help of semantic annotations and mine important events for navigating large text corpora. Semantic annotations such as ...
Application of association rules mining to Named Entity Recognition and co-reference resolution for the Indonesian language
In this paper, we propose a new method, association rules mining for Named Entity Recognition (NER) and co-reference resolution. The method uses several morphological and lexical features such as Pronoun Class (PC) and Name Class (NC), String Similarity ...
Comments