ABSTRACT
Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from such template-generated web pages without any learning examples or other similar human input. We formally define a template, and propose a model that describes how values are encoded into pages using a template. We present an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages. Experimental evaluation on a large number of real input page collections indicates that our algorithm correctly extracts data in most cases.
- S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley, Reading, Massachussetts, 1995.]] Google ScholarDigital Library
- Amazon.com. http://www.amazon.com.]]Google Scholar
- S. Brin. Extracting patterns and relations from the world wide web. In WebDB Workshop at 6th Intl. Conf. on Extending Database Technology, 1998.]] Google ScholarDigital Library
- C. Chang and S. Lui. IEPAD: Information extraction based on pattern discovery. In Proc. of 2001 Intl. World Wide Web Conf., pages 681--688, 2001.]] Google ScholarDigital Library
- V. Crescenzi, G. Mecca, and P. Merialdo. ROADRUNNER: Towards automatic data extraction from large web sites. In Proc. of the 2001 Intl. Conf. on Very Large Data Bases, pages 109--118, 2001.]] Google ScholarDigital Library
- Experimental results. http://www-db.stanford.edu/~arvind/extract/.]]Google Scholar
- H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. D. Ullman, and J. Widom. The TSIMMIS project: Integration of heterogenous information sources. Journal of Intelligent Information Systems, 8(2):117--132, 1997.]] Google ScholarDigital Library
- M. Garofalokis, A. Gionis, R. Rastogi, S. Seshadr, and K. Shim. XTRACT: A system for extracting document type descriptors from XML documents. In Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, pages 165--176, 2000.]] Google ScholarDigital Library
- E. M. Gold. Language identification in the limit. Information and Control, 10(5):447--474, 1967.]]Google ScholarCross Ref
- S. Grumbach and G. Mecca. In search of the lost schema. In Proc. of 1999 Intl. Conf. of Database Theory, pages 314--331, 1999.]] Google ScholarDigital Library
- L. M. Haas, D. Kossmann, E. L. Wimmers, and J. Yang. Optimizing queries across diverse data sources. In Proc. of the 1997 Intl. Conf. on Very Large Data Bases, pages 276--285, 1997.]] Google ScholarDigital Library
- J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha. Extracting semi structure information from the web. In Proceedings of the Workshop on Management of Semistructured Data, 1997.]]Google Scholar
- C. N. Hsu and M. T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems Special Issue on Semistructured Data, 23(8):521--538, 1998.]] Google ScholarDigital Library
- IEPAD:. http://www.csie/ncu.edu.tw/~chia.]]Google Scholar
- N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proc. of the 1997 Intl. Joint Conf. on Artificial Intelligence, pages 729--737, 1997.]]Google Scholar
- A. Laender, B. Ribeiro-Neto, A. da Silva, and J. Teixeira. A brief survey of web data extraction tools. Sigmod Record, 31(2), 2002.]] Google ScholarDigital Library
- A. Levy, A. Rajaraman, and J. J. Ordille. Querying heterogeneous information sources using source descriptions. In Proc. of the 1996 Intl. Conf. on Very Large Data Bases, pages 251--262, 1996.]] Google ScholarDigital Library
- L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabled wrapper construction system for web information sources. In Proc. of the 2000 Intl. Conf. on Data Engineering, pages 611--621, 2000.]] Google ScholarDigital Library
- I. Muslea, S. Minton, and C. A. Knoblock. A hierarchical approach to wrapper induction. In Proc. of Third Intl. Conf. on Autonomous Agents, pages 190--197, 1999.]] Google ScholarDigital Library
- L. Pitt. Inductive inference, DFAs, and computational complexity. Analogical and Inductive Inference, pages 18--44, 1989.]] Google ScholarDigital Library
- RISE:. http://www.isi.edu/~muslea/RISE/.]]Google Scholar
- J. Rissanen. Modeling by shortest data description. Automatica, 14:465--471, 1978.]]Google ScholarDigital Library
- ROADRUNNER:. http://www.dia.uniroma3.it/db/roadRunner/index.html.]]Google Scholar
- S. Sarawagi. Automation in InformationExtraction and Data Integration (tutorial). VLDB, 2002.]]Google Scholar
- J. D. Ullman. Information integration using logical views. In Proc. of 1997 Intl. Conf. on Database Theory, pages 19--40, 1997.]] Google ScholarDigital Library
Index Terms
- Extracting structured data from Web pages
Recommendations
Finding and Extracting Data Records from Web Pages
Many HTML pages are generated by software programs by querying some underlying databases and then filling in a template with the data. In these situations the metainformation about the data structure is lost, so automated software programs cannot ...
Extracting Content from Web Pages Based on RSS
CSSE '08: Proceedings of the 2008 International Conference on Computer Science and Software Engineering - Volume 05This paper proposes a new method to content extraction from web pages based on an index of RSS. Discover the collection of structural similarity web page documents in the RSS feed, and find the page template with the algorithm. By computing the feature ...
Extracting Topic Maps from Web Pages
New Frontiers in Applied Data MiningWe propose a framework to extract topic maps from a set of Web pages. We use the clustering method with the Web pages and extract the topic map prototypes. We introduced the following two points to the existing clustering method: The first is merging ...
Comments