ABSTRACT
Many tools have been developed to help users query, extract and integrate data from web pages generated dynamically from databases, i.e., from the Hidden Web. A key prerequisite for such tools is to obtain the schema of the attributes of the retrieved data. In this paper, we describe a system called, DeLa, which reconstructs (part of) a "hidden" back-end web database. It does this by sending queries through HTML forms, automatically generating regular expression wrappers to extract data objects from the result pages and restoring the retrieved data into an annotated (labelled) table. The whole process needs no human involvement and proves to be fast (less than one minute for wrapper induction for each site) and accurate (over 90% correctness for data extraction and around 80% correctness for label assignment).
- B. Adelberg. "NoDoSE - A tool for semi-automatically extracting structured and semistructured data from text documents," Proc. ACM SIGMOD Conf., 1998, 283--294. Google ScholarDigital Library
- R. Baumgartner, S. Flesca and G. Gottlob. "Visual web information extraction with Lixto," Proc.27th VLDB Conf., 2001, 119--128. Google ScholarDigital Library
- BrightPlanet Corp. "The Deep Web: Surfacing hidden value." http://www.completeplanet.com/Tutorials/DeepWeb/Google Scholar
- D. Buttler, L. Liu and C. Pu. "A fully automated object extraction system for the World Wide Web," Proc. Intl. Conf. on Distributed Computing Systems, 2001, 361--370. Google ScholarDigital Library
- C.H. Chang and S.C. Lui. "IEPAD: information extraction based on pattern discovery," Proc. 10th Intl. Conf. on World Wide Web, 2001, 681--688. Google ScholarDigital Library
- S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman and J. Widom. "The TRIMMIS project: integration of heterogeneous information sources," Proc. IPSJ Conference, 1994, 7--18.Google Scholar
- V. Crescenzi, G. Mecca and P. Merialdo. "ROADRUNNER: towards automatic data extraction from large web sites," Proc. 27th VLDB Conf., 2001, 109--118. Google ScholarDigital Library
- D. Embley, Y. Jiang and Y. K. Ng. "Record-boundary discovery in web documents," Proc. ACM SIGMOD Conf., 1999, 467--478. Google ScholarDigital Library
- D. Florescu, A. Y. Levy and A. O. Mendelzon. "Database techniques for the world-wide web: a survey," SIGMOD Record 27(3), 1998, 59--74. Google ScholarDigital Library
- D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge, 1997. Google ScholarDigital Library
- C. Hus and M. Dung. "Generating finite-state transducers for semi-structured data extraction from the web," Information Systems 23(8), 1998, 521--538. Google ScholarDigital Library
- T. Kirk, A. Levy, Y. Sagiv and D. Srivastava. "The Information Manifold," Proc. the AAAI Spring Symp. on Information Gathering from Heterogeneous, Distributed Environments, 1995, 85--91.Google Scholar
- L. Liu, C. Pu and W. Han. "XWRAP: An XML-enabled wrapper construction system for web information sources," Proc. 16th Intl. Conf. on Data Engineering (ICDE), 2000, 611--621. Google ScholarDigital Library
- I. Muslea, S. Minton and C. Knoblock. "A hierarchical approach to wrapper induction," Proc. 3rd Intl. Conf. on Autonomous Agents, 1999, 190--197. Google ScholarDigital Library
- S. Raghavan and H. Garcia-Molina. "Crawling the hidden web," Proc. 27th VLDB Conf., 2001, 129--138. Google ScholarDigital Library
- S. Raghavan and H. Garcia-Molina. "Integrating diverse information management systems: a brief survey," IEEE Data Engineering Bulletin 24(4), 2001, 44--52.Google Scholar
- B. Ribeiro-Neto, A. Laender and A.S. da Silva. "Extracting semi-structured data through examples," Proc. Intl. Conf. on Information and Knowledge Management, 1999, 94--101. Google ScholarDigital Library
- A. Sahuguet and F. Azavant. "WysiWyg web wrapper factory (W4F)," Proc. 8th World Wide Web, 1999.Google Scholar
- J. Wang and F. Lochovsky. "Data-rich section extraction from HTML pages," Proc. 3rd Conf. on Web Information Systems Engineering, 2002, 313--322. Google ScholarDigital Library
- J. Wang and F. Lochovsky. "Wrapper Induction based on nested pattern discovery." , Technical Report HKUST-CS-27-02, Dept. of Computer Science, Hong Kong U. of Science and Technology, 2002 (submitted for publication). http://www.cs.ust.hk/~cswangjy/paper/tr-27-02.pdfGoogle Scholar
- World Wide Web Consortium. Document Object Model Level 3 Core Specification, 2001.Google Scholar
- World Wide Web Consortium. HTML 4.01 Specification, 1999.Google Scholar
Index Terms
- Data extraction and label assignment for web databases
Recommendations
A QIIIEP based domain specific hidden web crawler
ICWET '11: Proceedings of the International Conference & Workshop on Emerging Trends in TechnologyFor context based surfing of World Wide Web in a systematic and automatic manner, a web crawler is required. The World Wide Web consists interlinked documents and resources that are easily crawled by general web crawler, known as surface web crawler. ...
Annotating Search Results from Web Databases
An increasing number of databases have become web accessible through HTML form-based search interfaces. The data units returned from the underlying database are usually encoded into the result pages dynamically for human browsing. For the encoded data ...
Web Data Extraction Based on Label Library
CSIE '09: Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering - Volume 05A Web data Extraction technique based on label library is proposed for extracting information from data intensive Web pages in this paper. It eliminates conception ambiguity of the contents of Web pages with the label library, mines data regions by ...
Comments