skip to main content
10.1145/2479832.2479845acmconferencesArticle/Chapter ViewAbstractPublication Pagesk-capConference Proceedingsconference-collections
research-article
Open Access

Unsupervised wrapper induction using linked data

Published:23 June 2013Publication History

ABSTRACT

This work explores the usage of Linked Data for Web scale Information Extraction and shows encouraging results on the task of Wrapper Induction. We propose a simple knowledge based method which is (i) highly flexible with respect to different domains and (ii) does not require any training material, but exploits Linked Data as background knowledge source to build essential learning resources. The major contribution of this work is a study of how Linked Data - an imprecise, redundant and large-scale knowledge resource - can be used to support Web scale Information Extraction in an effective and efficient way and identify the challenges involved. We show that, for domains that are covered, Linked Data serve as a powerful knowledge resource for Information Extraction. Experiments on a publicly available dataset demonstrate that, under certain conditions, this simple unsupervised approach can achieve competitive results against some complex state of the art that always depends on training data.

References

  1. A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 337--348. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Blanco, H. Halpin, D. Herzig, and P. Mika. Entity search evaluation over structured web data. In SIGIR 2011, 2011.Google ScholarGoogle Scholar
  3. A. Carlson and C. Schafer. Bootstrapping information extraction from semi-structured web pages. e European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. V. Crescenzi and G. Mecca. Automatic information extraction from large websites. Journal of the ACM,51(5):731--779, Sept. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction: an approach based on a probabilistic tree-edit model. Proceedings of the 35th SIGMOD international conference on Management of data, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. N. Dalvi, R. Kumar, and M. Soliman. Automatic wrappers for large scale web extraction. Proceedings of the VLDB Endowment, 4(4):219--230, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Gulhane, A. Madaan, R. Mehta, J. Ramamirtham, R. Rastogi, S. Satpal, S. H. Sengamedu, A. Tengli, and C. Tiwari. Web-scale information extraction with vertex. 2011 IEEE 27th International Conference on Data Engineering, pages 1209--1220, Apr. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Q. Hao, R. Cai, Y. Pang, and L. Zhang. From One Tree to a Forest : a Unified Solution for Structured Web Data Extraction. In SIGIR 2011, pages 775--784, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Heath and C. Bizer. Linked data: Evolving the web into a global data space. Synthesis Lectures on the Semantic Web: Theory and Technology, 1(1):1--136, 2011. Google ScholarGoogle ScholarCross RefCross Ref
  10. G. Kobilarov, C. Bizer, S. Auer, and J. Lehmann. DBpedia-A Linked Data Hub and Data Source for Web and Enterprise Applications. In WWW2009, pages 1--3, 2009.Google ScholarGoogle Scholar
  11. S. Krause, H. Li, H. Uszkoreit, and F. Xu. Large-scale learning of relation-extraction rules with distant supervision from the web. In Proceedings of the 11th international conference on The Semantic Web - Volume Part I, ISWC'12, pages 263--278, Berlin, Heidelberg, 2012. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. Kushmerick. Wrapper Induction for information Extraction. In IJCAI97, pages 729--735, 1997.Google ScholarGoogle Scholar
  13. V. Lopez, M. Fernändez, E. Motta, and N. Stieler.Poweraqua: Supporting users in querying and exploring the semantic web. Semantic Web, 3(3):249--265, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  14. V. Mulwad, T. Finin, Z. Syed, and A. Joshi. Using linked data to interpret tables. In O. Hartig, A. Harth, and J. Sequeda, editors, COLD, volume 665 of CEUR Workshop Proceedings. CEUR-WS.org, 2010.Google ScholarGoogle Scholar
  15. I. Muslea, S. Minton, and C. Knoblock. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, pages 1--28, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. I. Muslea, S. Minton, and C. Knoblock. Active Learning with Strong and Weak Views : A Case Study on Wrapper Induction. IJCAI'03 8th international joint conference on Artificial intelligence, pages 415--420, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Nikolov, V. Uren, E. Motta, and A. Roeck. Overcoming schema heterogeneity between linked semantic repositories to improve coreference resolution. In Proceedings of the 4th Asian Conference on The Semantic Web, ASWC '09, pages 332--346, Berlin, Heidelberg, 2009. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Parameswaran, N. Dalvi, H. Garcia-Molina, and R. Rastogi. Optimal Schemes for Robust Web Extraction. In 37th International Conference on Very Large Data Bases, 2011.Google ScholarGoogle Scholar
  19. S. Soderland. Learning information extraction rules for semi-structured and free text. Mach. Learn., 34(1--3):23--272, Feb. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Song, Y. Wu, L. Liao, L. Li, and F. Sun. A dynamic learning framework to thoroughly extract structured data from web pages without human efforts. Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics - MDS '12, l:1--8, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Welty, J. Fan, D. Gondek, and A. Schlaikjer. Large scale relation detection. In Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading, FAM-LbR '10, pages 24--33, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. Wong and W. Lam. Learning to adapt web information extraction knowledge and discovering new attributes via a Bayesian approach. Knowledge and Data Engineering, IEEE, 22(4):523--536, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Unsupervised wrapper induction using linked data

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      K-CAP '13: Proceedings of the seventh international conference on Knowledge capture
      June 2013
      160 pages
      ISBN:9781450321020
      DOI:10.1145/2479832

      Copyright © 2013 Owner/Author

      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 June 2013

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      K-CAP '13 Paper Acceptance Rate13of60submissions,22%Overall Acceptance Rate55of198submissions,28%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader