skip to main content
10.1145/872757.872799acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Extracting structured data from Web pages

Published:09 June 2003Publication History

ABSTRACT

Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from such template-generated web pages without any learning examples or other similar human input. We formally define a template, and propose a model that describes how values are encoded into pages using a template. We present an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages. Experimental evaluation on a large number of real input page collections indicates that our algorithm correctly extracts data in most cases.

References

  1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley, Reading, Massachussetts, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Amazon.com. http://www.amazon.com.]]Google ScholarGoogle Scholar
  3. S. Brin. Extracting patterns and relations from the world wide web. In WebDB Workshop at 6th Intl. Conf. on Extending Database Technology, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Chang and S. Lui. IEPAD: Information extraction based on pattern discovery. In Proc. of 2001 Intl. World Wide Web Conf., pages 681--688, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. V. Crescenzi, G. Mecca, and P. Merialdo. ROADRUNNER: Towards automatic data extraction from large web sites. In Proc. of the 2001 Intl. Conf. on Very Large Data Bases, pages 109--118, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Experimental results. http://www-db.stanford.edu/~arvind/extract/.]]Google ScholarGoogle Scholar
  7. H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. D. Ullman, and J. Widom. The TSIMMIS project: Integration of heterogenous information sources. Journal of Intelligent Information Systems, 8(2):117--132, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Garofalokis, A. Gionis, R. Rastogi, S. Seshadr, and K. Shim. XTRACT: A system for extracting document type descriptors from XML documents. In Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, pages 165--176, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. M. Gold. Language identification in the limit. Information and Control, 10(5):447--474, 1967.]]Google ScholarGoogle ScholarCross RefCross Ref
  10. S. Grumbach and G. Mecca. In search of the lost schema. In Proc. of 1999 Intl. Conf. of Database Theory, pages 314--331, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. L. M. Haas, D. Kossmann, E. L. Wimmers, and J. Yang. Optimizing queries across diverse data sources. In Proc. of the 1997 Intl. Conf. on Very Large Data Bases, pages 276--285, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha. Extracting semi structure information from the web. In Proceedings of the Workshop on Management of Semistructured Data, 1997.]]Google ScholarGoogle Scholar
  13. C. N. Hsu and M. T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems Special Issue on Semistructured Data, 23(8):521--538, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. IEPAD:. http://www.csie/ncu.edu.tw/~chia.]]Google ScholarGoogle Scholar
  15. N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proc. of the 1997 Intl. Joint Conf. on Artificial Intelligence, pages 729--737, 1997.]]Google ScholarGoogle Scholar
  16. A. Laender, B. Ribeiro-Neto, A. da Silva, and J. Teixeira. A brief survey of web data extraction tools. Sigmod Record, 31(2), 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Levy, A. Rajaraman, and J. J. Ordille. Querying heterogeneous information sources using source descriptions. In Proc. of the 1996 Intl. Conf. on Very Large Data Bases, pages 251--262, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabled wrapper construction system for web information sources. In Proc. of the 2000 Intl. Conf. on Data Engineering, pages 611--621, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. I. Muslea, S. Minton, and C. A. Knoblock. A hierarchical approach to wrapper induction. In Proc. of Third Intl. Conf. on Autonomous Agents, pages 190--197, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. Pitt. Inductive inference, DFAs, and computational complexity. Analogical and Inductive Inference, pages 18--44, 1989.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. RISE:. http://www.isi.edu/~muslea/RISE/.]]Google ScholarGoogle Scholar
  22. J. Rissanen. Modeling by shortest data description. Automatica, 14:465--471, 1978.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. ROADRUNNER:. http://www.dia.uniroma3.it/db/roadRunner/index.html.]]Google ScholarGoogle Scholar
  24. S. Sarawagi. Automation in InformationExtraction and Data Integration (tutorial). VLDB, 2002.]]Google ScholarGoogle Scholar
  25. J. D. Ullman. Information integration using logical views. In Proc. of 1997 Intl. Conf. on Database Theory, pages 19--40, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Extracting structured data from Web pages

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data
        June 2003
        702 pages
        ISBN:158113634X
        DOI:10.1145/872757

        Copyright © 2003 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 June 2003

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        SIGMOD '03 Paper Acceptance Rate53of342submissions,15%Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader