skip to main content
10.1145/775152.775178acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

Published:20 May 2003Publication History

ABSTRACT

This paper describes Seeker, a platform for large-scale text analytics, and SemTag, an application written on the platform to perform automated semantic tagging of large corpora. We apply SemTag to a collection of approximately 264 million web pages, and generate approximately 434 million automatically disambiguated semantic tags, published to the web as a label bureau providing metadata regarding the 434 million annotations. To our knowledge, this is the largest scale semantic tagging effort to date.We describe the Seeker platform, discuss the architecture of the SemTag application, describe a new disambiguation algorithm specialized to support ontological disambiguation of large-scale data, evaluate the algorithm, and present our final results with information about acquiring and making use of the semantic tags. We argue that automated large scale semantic tagging of ambiguous content can bootstrap and accelerate the creation of the semantic web.

References

  1. S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The lorel query language for semistructured data. International Journal of Digital Libraries, 1(1):68--88, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  2. R. Agrawal, R. Bayardo, D. Gruhl, and S. Papadimitriou. Vinci: A service-oriented architecture for rapid development of web applications. In Proceedings of the Tenth International World Wide Web Conference (WWW2001), pages 355--365, Hong Kong, China, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. AltaVista. http://www.altavista.com.Google ScholarGoogle Scholar
  4. G. Arocena, A. Mendelzon, and G. Mihaila. Applications of a Web query language. In Proceedings of the 6th International World Wide Web Conference (WWW1997), pages 1305--1315, Santa Clara, CA, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. Berners-Lee, J. Hendler, and O. Lassila. Semantic web. Scientific American, 1(1):68--88, 2000.Google ScholarGoogle Scholar
  6. D. Box, D. Ehnebuske, G. Kakivaya, A. Layman, N. Mendelsohn, H. F. Nielsen, S. Thatte, and D. Winder. Simple Object Access Protocol. http://www.w3.org/TR/SOAP/, May 2000.Google ScholarGoogle Scholar
  7. D. Brickley and R.V.Guha. Rdf schema. http://www.w3.org/TR/rdf-schema/.Google ScholarGoogle Scholar
  8. A. Broder and M. R. Henzinger. Algorithmic aspects of information retrieval on the web. In M. G. C. R. J. Abello, P. M. Pardalos, editor, Handbook of Massive Data Sets. Kluwer Academic Publishers, Boston, to appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Clarke, G. Cormack, and F. Burkowski. Shortest substring ranking. In Proceedings of the Fourth Text Retrieval Conference, pages 295--304, Gaithersburg, MD, November 1995.Google ScholarGoogle Scholar
  10. W. Cohen and L. Jensen. A structured wrapper induction system for extracting information from semi-structured documents. In Proceedings of the Workshop on Adaptive Text Extraction and Mining (IJCAI'01), 2001.Google ScholarGoogle Scholar
  11. M. Erdmann, A. Maedche, H. Schnurr, and S. Staab. From manual to semi-automatic semantic annotation: About ontology-based text annotation tools. In P. Buitelaar and K. Hasida, editors, Proceedings of the COLING 2000 Workshop on Semantic Annotation and Intelligent Content, August 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Google. http://www.google.com.Google ScholarGoogle Scholar
  13. T. R. Gruber. Towards Principles for the Design of Ontologies Used for Knowledge Sharing. In N. Guarino and R. Poli, editors, Formal Ontology in Conceptual Analysis and Knowledge Representation, Deventer, The Netherlands, 1993. Kluwer Academic Publishers.Google ScholarGoogle Scholar
  14. J. Heflin and J. Hendler. Searching the web with shoe. In AAAI-2000 Workshop on AI for Web Search, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  15. J. M. Hellerstein, M. J. Franklin, S. Chandrasekaran, A. Deshpande, K. Hilldrum, D. Maden, V. Raman, and M. A. Shah. Adaptive query processing: Technology in evolution. IEEE Data Engineering Bulletin, 23(2):7--18, 2000.Google ScholarGoogle Scholar
  16. J. Hirai, S. Raghavan, A. Paepcke, and H. Garcia-Molina. WebBase: A repository of Web pages. In Proceedings of the 9th International World Wide Web Conference (WWW2000), pages 277-293, Amsterdam, The Netherlands, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Kahan and M.-R. Koivunen. Annotea: an open RDF infrastructure for shared web annotations. In World Wide Web, pages 623--632, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In Intl. Joint Conference on Artificial Intelligence (IJCAI), pages 729--737, 1997.Google ScholarGoogle Scholar
  19. T. Leonard and H. Glaser. Large scale acquisition and maintenance from the web without source access. http://semannot2001.aifb.uni-karlsruhe.de/positionpapers/Leonard.pdf, 2001.Google ScholarGoogle Scholar
  20. K. Lerman, C. Knoblock, and S. Minton. Automatic data extraction from lists and tables in web sources. In IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, August 2001.Google ScholarGoogle Scholar
  21. G.-A. Levow. Corpus-based techniques for word sense disambiguation. Technical Report AIM-1637, MIT AI Lab, 1, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Li, L. Zhang, and Y. Yu. Learning to generate semantic annotation for domain specific sentences. http://semannot2001.aifb.uni-karlsruhe.de/positionpapers/GenerateSemAnnot.pdf.Google ScholarGoogle Scholar
  23. P. K. Lockheed. AeroDAML: Applying information extraction to generate DAML annotations from web pages.Google ScholarGoogle Scholar
  24. D. L. McGuinness. Description logics emerge from ivory towers. In Description Logics, 2001.Google ScholarGoogle Scholar
  25. G. Mecca, A. Mendelzon, and P. Merialdo. Efficient queries over web views. In Proceedings of the 6th International Conference on Extending Database Technology (EDBT'98), volume LNCS 1377, pages 72--86, Valencia, Spain, 1998. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. Mihalcea. Word sense disambiguation and its application to the internet search. Master's thesis, Southern Methodist University, 1999.Google ScholarGoogle Scholar
  27. A. Newell. Some problems of the basic organization in problem-solving programs. In Proceedings of the Second Conference on Self-Organizing Systems, pages 393--423, Washington, DC, 1962.Google ScholarGoogle Scholar
  28. N. F. Noy, M. Sintek, S. Decker, M. Crubezy, R. W. Fergerson, and M. A. Musen. Creating semantic web contents with protege-2000. IEEE Intelligent Systems, 2(16):60--71, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Pustejovsky, B. Boguraev, M. Verhagen, P. Buitelaar, and M. Johnston. Semantic indexing and typed hyperlinking. In Proceedings of the American Association for Artificial Intelligence Conference, Spring Symposium, NLP for WWW, pages 120--128, 1997.Google ScholarGoogle Scholar
  30. R.Guha and R. McCool. Tap: Towards a web of data. http://tap.stanford.edu/.Google ScholarGoogle Scholar
  31. E. Riloff and J. Shepherd. A corpus-based approach for building semantic lexicons. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP-97), pages 117--124, Providence, RI, 1997.Google ScholarGoogle Scholar
  32. H. Schütze. Automatic word sense discrimination. Computational Linguistics, 24(1):97--124, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. E. Spertus and L. A. Stein. Squeal: A structured query language for the web. In Proceedings of the 9th International World Wide Web Conference (WWW2000), pages 95--103, Amsterdam, The Netherlands, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Staab, A. Maedche, and S. Handschuh. An annotation framework for the semantic web. In S. Isjizaki, editor, Proceedings of the First Workshop on Multimedia Annotation, Tokyo, Japan, January 2001.Google ScholarGoogle Scholar
  35. The Internet Archive. http://www.archive.org.Google ScholarGoogle Scholar
  36. M. Vargas-Vera, E. Motta, J. Domingue, M. Lanzoni, A. Stutt, and F. Ciravegna. MnM: Ontology driven semi-automatic and automatic support for semantic markup. In The 13th International Conference on Knowledge Engineering and Management (EKAW 2002), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. W3C. Platform for internet content selection. http://www.w3.org/PICS/.Google ScholarGoogle Scholar
  38. W3C. Web ontology language. http://www.w3.org/2001/sw/WebOnt/.Google ScholarGoogle Scholar
  39. Web-in-a-Box. http://research.compaq.com/SRC/WebArcheology/wib.html.Google ScholarGoogle Scholar
  40. Y. Wilks and M. Stevenson. Sense tagging: Semantic tagging with a lexicon. In Proceedings of the SIGLEX Workshop Tagging Text with Lexical Semantics: What, why and how?, pages 47--51, 1997.Google ScholarGoogle Scholar

Index Terms

  1. SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                WWW '03: Proceedings of the 12th international conference on World Wide Web
                May 2003
                772 pages
                ISBN:1581136803
                DOI:10.1145/775152

                Copyright © 2003 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 20 May 2003

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • Article

                Acceptance Rates

                Overall Acceptance Rate1,899of8,196submissions,23%

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader