Article

Extracting structured data from Web pages

Authors:
Arvind Arasu

Stanford University, Palo Alto, CA

Stanford University, Palo Alto, CA
View Profile

,
Hector Garcia-Molina

Stanford University, Palo Alto, CA

Stanford University, Palo Alto, CA
View Profile

SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of dataJune 2003Pages 337–348https://doi.org/10.1145/872757.872799

Published:09 June 2003Publication History

SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data

Pages 337–348

ABSTRACT

Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from such template-generated web pages without any learning examples or other similar human input. We formally define a template, and propose a model that describes how values are encoded into pages using a template. We present an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages. Experimental evaluation on a large number of real input page collections indicates that our algorithm correctly extracts data in most cases.

References

S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley, Reading, Massachussetts, 1995.]] Google ScholarDigital Library
Amazon.com. http://www.amazon.com.]]Google Scholar
S. Brin. Extracting patterns and relations from the world wide web. In WebDB Workshop at 6th Intl. Conf. on Extending Database Technology, 1998.]] Google ScholarDigital Library
C. Chang and S. Lui. IEPAD: Information extraction based on pattern discovery. In Proc. of 2001 Intl. World Wide Web Conf., pages 681--688, 2001.]] Google ScholarDigital Library
V. Crescenzi, G. Mecca, and P. Merialdo. ROADRUNNER: Towards automatic data extraction from large web sites. In Proc. of the 2001 Intl. Conf. on Very Large Data Bases, pages 109--118, 2001.]] Google ScholarDigital Library
Experimental results. http://www-db.stanford.edu/~arvind/extract/.]]Google Scholar
H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. D. Ullman, and J. Widom. The TSIMMIS project: Integration of heterogenous information sources. Journal of Intelligent Information Systems, 8(2):117--132, 1997.]] Google ScholarDigital Library
M. Garofalokis, A. Gionis, R. Rastogi, S. Seshadr, and K. Shim. XTRACT: A system for extracting document type descriptors from XML documents. In Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, pages 165--176, 2000.]] Google ScholarDigital Library
E. M. Gold. Language identification in the limit. Information and Control, 10(5):447--474, 1967.]]Google ScholarCross Ref
S. Grumbach and G. Mecca. In search of the lost schema. In Proc. of 1999 Intl. Conf. of Database Theory, pages 314--331, 1999.]] Google ScholarDigital Library
L. M. Haas, D. Kossmann, E. L. Wimmers, and J. Yang. Optimizing queries across diverse data sources. In Proc. of the 1997 Intl. Conf. on Very Large Data Bases, pages 276--285, 1997.]] Google ScholarDigital Library
J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha. Extracting semi structure information from the web. In Proceedings of the Workshop on Management of Semistructured Data, 1997.]]Google Scholar
C. N. Hsu and M. T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems Special Issue on Semistructured Data, 23(8):521--538, 1998.]] Google ScholarDigital Library
IEPAD:. http://www.csie/ncu.edu.tw/~chia.]]Google Scholar
N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proc. of the 1997 Intl. Joint Conf. on Artificial Intelligence, pages 729--737, 1997.]]Google Scholar
A. Laender, B. Ribeiro-Neto, A. da Silva, and J. Teixeira. A brief survey of web data extraction tools. Sigmod Record, 31(2), 2002.]] Google ScholarDigital Library
A. Levy, A. Rajaraman, and J. J. Ordille. Querying heterogeneous information sources using source descriptions. In Proc. of the 1996 Intl. Conf. on Very Large Data Bases, pages 251--262, 1996.]] Google ScholarDigital Library
L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabled wrapper construction system for web information sources. In Proc. of the 2000 Intl. Conf. on Data Engineering, pages 611--621, 2000.]] Google ScholarDigital Library
I. Muslea, S. Minton, and C. A. Knoblock. A hierarchical approach to wrapper induction. In Proc. of Third Intl. Conf. on Autonomous Agents, pages 190--197, 1999.]] Google ScholarDigital Library
L. Pitt. Inductive inference, DFAs, and computational complexity. Analogical and Inductive Inference, pages 18--44, 1989.]] Google ScholarDigital Library
RISE:. http://www.isi.edu/~muslea/RISE/.]]Google Scholar
J. Rissanen. Modeling by shortest data description. Automatica, 14:465--471, 1978.]]Google ScholarDigital Library
ROADRUNNER:. http://www.dia.uniroma3.it/db/roadRunner/index.html.]]Google Scholar
S. Sarawagi. Automation in InformationExtraction and Data Integration (tutorial). VLDB, 2002.]]Google Scholar
J. D. Ullman. Information integration using logical views. In Proc. of 1997 Intl. Conf. on Database Theory, pages 19--40, 1997.]] Google ScholarDigital Library

Index Terms

Extracting structured data from Web pages
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Finding and Extracting Data Records from Web Pages

Many HTML pages are generated by software programs by querying some underlying databases and then filling in a template with the data. In these situations the metainformation about the data structure is lost, so automated software programs cannot ...
Read More
Extracting Content from Web Pages Based on RSS
CSSE '08: Proceedings of the 2008 International Conference on Computer Science and Software Engineering - Volume 05

This paper proposes a new method to content extraction from web pages based on an index of RSS. Discover the collection of structural similarity web page documents in the RSS feed, and find the page template with the algorithm. By computing the feature ...
Read More
Extracting Topic Maps from Web Pages
New Frontiers in Applied Data Mining

We propose a framework to extract topic maps from a set of Web pages. We use the clustering method with the Web pages and extract the topic map prototypes. We introduced the following two points to the existing clustering method: The first is merging ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data
June 2003
702 pages
ISBN:158113634X
DOI:10.1145/872757
Conference Chair:
Zachary Ives
University of Pennsylvania
,
General Chair:
Yannis Papakonstantinou
University of California, San Diego
,
Program Chair:
Alon Halevy
University of Washington
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 June 2003
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
SIGMOD '03 Paper Acceptance Rate53of342submissions,15%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 508
  Total Citations
  View Citations
- 4,491
  Total Downloads
- Downloads (Last 12 months)133
- Downloads (Last 6 weeks)23
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Extracting structured data from Web pages

SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Finding and Extracting Data Records from Web Pages

Extracting Content from Web Pages Based on RSS

Extracting Topic Maps from Web Pages