skip to main content
10.1145/775152.775179acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

Data extraction and label assignment for web databases

Authors Info & Claims
Published:20 May 2003Publication History

ABSTRACT

Many tools have been developed to help users query, extract and integrate data from web pages generated dynamically from databases, i.e., from the Hidden Web. A key prerequisite for such tools is to obtain the schema of the attributes of the retrieved data. In this paper, we describe a system called, DeLa, which reconstructs (part of) a "hidden" back-end web database. It does this by sending queries through HTML forms, automatically generating regular expression wrappers to extract data objects from the result pages and restoring the retrieved data into an annotated (labelled) table. The whole process needs no human involvement and proves to be fast (less than one minute for wrapper induction for each site) and accurate (over 90% correctness for data extraction and around 80% correctness for label assignment).

References

  1. B. Adelberg. "NoDoSE - A tool for semi-automatically extracting structured and semistructured data from text documents," Proc. ACM SIGMOD Conf., 1998, 283--294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Baumgartner, S. Flesca and G. Gottlob. "Visual web information extraction with Lixto," Proc.27th VLDB Conf., 2001, 119--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. BrightPlanet Corp. "The Deep Web: Surfacing hidden value." http://www.completeplanet.com/Tutorials/DeepWeb/Google ScholarGoogle Scholar
  4. D. Buttler, L. Liu and C. Pu. "A fully automated object extraction system for the World Wide Web," Proc. Intl. Conf. on Distributed Computing Systems, 2001, 361--370. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C.H. Chang and S.C. Lui. "IEPAD: information extraction based on pattern discovery," Proc. 10th Intl. Conf. on World Wide Web, 2001, 681--688. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman and J. Widom. "The TRIMMIS project: integration of heterogeneous information sources," Proc. IPSJ Conference, 1994, 7--18.Google ScholarGoogle Scholar
  7. V. Crescenzi, G. Mecca and P. Merialdo. "ROADRUNNER: towards automatic data extraction from large web sites," Proc. 27th VLDB Conf., 2001, 109--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Embley, Y. Jiang and Y. K. Ng. "Record-boundary discovery in web documents," Proc. ACM SIGMOD Conf., 1999, 467--478. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Florescu, A. Y. Levy and A. O. Mendelzon. "Database techniques for the world-wide web: a survey," SIGMOD Record 27(3), 1998, 59--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Hus and M. Dung. "Generating finite-state transducers for semi-structured data extraction from the web," Information Systems 23(8), 1998, 521--538. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. Kirk, A. Levy, Y. Sagiv and D. Srivastava. "The Information Manifold," Proc. the AAAI Spring Symp. on Information Gathering from Heterogeneous, Distributed Environments, 1995, 85--91.Google ScholarGoogle Scholar
  13. L. Liu, C. Pu and W. Han. "XWRAP: An XML-enabled wrapper construction system for web information sources," Proc. 16th Intl. Conf. on Data Engineering (ICDE), 2000, 611--621. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. I. Muslea, S. Minton and C. Knoblock. "A hierarchical approach to wrapper induction," Proc. 3rd Intl. Conf. on Autonomous Agents, 1999, 190--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Raghavan and H. Garcia-Molina. "Crawling the hidden web," Proc. 27th VLDB Conf., 2001, 129--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Raghavan and H. Garcia-Molina. "Integrating diverse information management systems: a brief survey," IEEE Data Engineering Bulletin 24(4), 2001, 44--52.Google ScholarGoogle Scholar
  17. B. Ribeiro-Neto, A. Laender and A.S. da Silva. "Extracting semi-structured data through examples," Proc. Intl. Conf. on Information and Knowledge Management, 1999, 94--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Sahuguet and F. Azavant. "WysiWyg web wrapper factory (W4F)," Proc. 8th World Wide Web, 1999.Google ScholarGoogle Scholar
  19. J. Wang and F. Lochovsky. "Data-rich section extraction from HTML pages," Proc. 3rd Conf. on Web Information Systems Engineering, 2002, 313--322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Wang and F. Lochovsky. "Wrapper Induction based on nested pattern discovery." , Technical Report HKUST-CS-27-02, Dept. of Computer Science, Hong Kong U. of Science and Technology, 2002 (submitted for publication). http://www.cs.ust.hk/~cswangjy/paper/tr-27-02.pdfGoogle ScholarGoogle Scholar
  21. World Wide Web Consortium. Document Object Model Level 3 Core Specification, 2001.Google ScholarGoogle Scholar
  22. World Wide Web Consortium. HTML 4.01 Specification, 1999.Google ScholarGoogle Scholar

Index Terms

  1. Data extraction and label assignment for web databases

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          WWW '03: Proceedings of the 12th international conference on World Wide Web
          May 2003
          772 pages
          ISBN:1581136803
          DOI:10.1145/775152

          Copyright © 2003 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 20 May 2003

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate1,899of8,196submissions,23%

          Upcoming Conference

          WWW '24
          The ACM Web Conference 2024
          May 13 - 17, 2024
          Singapore , Singapore

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader