Article

Data extraction and label assignment for web databases

Authors:
Jiying Wang

University of Science and Technology Clear Water Bay, Kowloon, Hong Kong

University of Science and Technology Clear Water Bay, Kowloon, Hong Kong
View Profile

,
Fred H. Lochovsky

University of Science and Technology Clear Water Bay, Kowloon, Hong Kong

University of Science and Technology Clear Water Bay, Kowloon, Hong Kong
View Profile

WWW '03: Proceedings of the 12th international conference on World Wide WebMay 2003Pages 187–196https://doi.org/10.1145/775152.775179

Published:20 May 2003Publication History

WWW '03: Proceedings of the 12th international conference on World Wide Web

Pages 187–196

ABSTRACT

Many tools have been developed to help users query, extract and integrate data from web pages generated dynamically from databases, i.e., from the Hidden Web. A key prerequisite for such tools is to obtain the schema of the attributes of the retrieved data. In this paper, we describe a system called, DeLa, which reconstructs (part of) a "hidden" back-end web database. It does this by sending queries through HTML forms, automatically generating regular expression wrappers to extract data objects from the result pages and restoring the retrieved data into an annotated (labelled) table. The whole process needs no human involvement and proves to be fast (less than one minute for wrapper induction for each site) and accurate (over 90% correctness for data extraction and around 80% correctness for label assignment).

References

B. Adelberg. "NoDoSE - A tool for semi-automatically extracting structured and semistructured data from text documents," Proc. ACM SIGMOD Conf., 1998, 283--294. Google ScholarDigital Library
R. Baumgartner, S. Flesca and G. Gottlob. "Visual web information extraction with Lixto," Proc.27th VLDB Conf., 2001, 119--128. Google ScholarDigital Library
BrightPlanet Corp. "The Deep Web: Surfacing hidden value." http://www.completeplanet.com/Tutorials/DeepWeb/Google Scholar
D. Buttler, L. Liu and C. Pu. "A fully automated object extraction system for the World Wide Web," Proc. Intl. Conf. on Distributed Computing Systems, 2001, 361--370. Google ScholarDigital Library
C.H. Chang and S.C. Lui. "IEPAD: information extraction based on pattern discovery," Proc. 10th Intl. Conf. on World Wide Web, 2001, 681--688. Google ScholarDigital Library
S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman and J. Widom. "The TRIMMIS project: integration of heterogeneous information sources," Proc. IPSJ Conference, 1994, 7--18.Google Scholar
V. Crescenzi, G. Mecca and P. Merialdo. "ROADRUNNER: towards automatic data extraction from large web sites," Proc. 27th VLDB Conf., 2001, 109--118. Google ScholarDigital Library
D. Embley, Y. Jiang and Y. K. Ng. "Record-boundary discovery in web documents," Proc. ACM SIGMOD Conf., 1999, 467--478. Google ScholarDigital Library
D. Florescu, A. Y. Levy and A. O. Mendelzon. "Database techniques for the world-wide web: a survey," SIGMOD Record 27(3), 1998, 59--74. Google ScholarDigital Library
D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge, 1997. Google ScholarDigital Library
C. Hus and M. Dung. "Generating finite-state transducers for semi-structured data extraction from the web," Information Systems 23(8), 1998, 521--538. Google ScholarDigital Library
T. Kirk, A. Levy, Y. Sagiv and D. Srivastava. "The Information Manifold," Proc. the AAAI Spring Symp. on Information Gathering from Heterogeneous, Distributed Environments, 1995, 85--91.Google Scholar
L. Liu, C. Pu and W. Han. "XWRAP: An XML-enabled wrapper construction system for web information sources," Proc. 16th Intl. Conf. on Data Engineering (ICDE), 2000, 611--621. Google ScholarDigital Library
I. Muslea, S. Minton and C. Knoblock. "A hierarchical approach to wrapper induction," Proc. 3rd Intl. Conf. on Autonomous Agents, 1999, 190--197. Google ScholarDigital Library
S. Raghavan and H. Garcia-Molina. "Crawling the hidden web," Proc. 27th VLDB Conf., 2001, 129--138. Google ScholarDigital Library
S. Raghavan and H. Garcia-Molina. "Integrating diverse information management systems: a brief survey," IEEE Data Engineering Bulletin 24(4), 2001, 44--52.Google Scholar
B. Ribeiro-Neto, A. Laender and A.S. da Silva. "Extracting semi-structured data through examples," Proc. Intl. Conf. on Information and Knowledge Management, 1999, 94--101. Google ScholarDigital Library
A. Sahuguet and F. Azavant. "WysiWyg web wrapper factory (W4F)," Proc. 8th World Wide Web, 1999.Google Scholar
J. Wang and F. Lochovsky. "Data-rich section extraction from HTML pages," Proc. 3rd Conf. on Web Information Systems Engineering, 2002, 313--322. Google ScholarDigital Library
J. Wang and F. Lochovsky. "Wrapper Induction based on nested pattern discovery." , Technical Report HKUST-CS-27-02, Dept. of Computer Science, Hong Kong U. of Science and Technology, 2002 (submitted for publication). http://www.cs.ust.hk/~cswangjy/paper/tr-27-02.pdfGoogle Scholar
World Wide Web Consortium. Document Object Model Level 3 Core Specification, 2001.Google Scholar
World Wide Web Consortium. HTML 4.01 Specification, 1999.Google Scholar

Index Terms

Data extraction and label assignment for web databases

Recommendations

A QIIIEP based domain specific hidden web crawler
ICWET '11: Proceedings of the International Conference & Workshop on Emerging Trends in Technology

For context based surfing of World Wide Web in a systematic and automatic manner, a web crawler is required. The World Wide Web consists interlinked documents and resources that are easily crawled by general web crawler, known as surface web crawler. ...
Read More
Annotating Search Results from Web Databases

An increasing number of databases have become web accessible through HTML form-based search interfaces. The data units returned from the underlying database are usually encoded into the result pages dynamically for human browsing. For the encoded data ...
Read More
Web Data Extraction Based on Label Library
CSIE '09: Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering - Volume 05

A Web data Extraction technique based on label library is proposed for extracting information from data intensive Web pages in this paper. It eliminates conception ambiguity of the contents of Web pages with the label library, mines data regions by ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '03: Proceedings of the 12th international conference on World Wide Web
May 2003
772 pages
ISBN:1581136803
DOI:10.1145/775152
Conference Chairs:
Gusztáv Hencsey
MTA SZTAKI, Hungary
,
Bebo White
Stanford Linear Accelerator Center, USA
,
Program Chairs:
Yih-Farn Robin Chen
AT&T Labs -- Research, USA
,
László Kovács
MTA SZTAKI, Hungary
,
Steve Lawrence
Google Inc., USA
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 May 2003
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
HTML forms
automatic wrapper induction
data annotation
hidden web
information integration
web information extraction
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 156
  Total Citations
  View Citations
- 2,889
  Total Downloads
- Downloads (Last 12 months)22
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data extraction and label assignment for web databases

WWW '03: Proceedings of the 12th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

A QIIIEP based domain specific hidden web crawler

Annotating Search Results from Web Databases

Web Data Extraction Based on Label Library

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Data extraction and label assignment for web databases

WWW '03: Proceedings of the 12th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

A QIIIEP based domain specific hidden web crawler

Annotating Search Results from Web Databases

Web Data Extraction Based on Label Library

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media