ABSTRACT
The Web, the largest unstructured database of the world, has greatly improved access to documents. However, documents on the Web are largely disorganized. Due to the distributed nature of the World Wide Web it is difficult to use it as a tool for information and knowledge management. Therefore, users doing the difficult task of exploring the Web have to be supported by intelligent means.This paper proposes an approach for document discovery building on a comprehensive framework for ontology-focused crawling of Web documents. Our framework includes means for using a complex ontology and associated instance elements. It defines several relevance computation strategies and provides an empirical evaluation which has shown promising results.
- C. C. Aggarwal, F. Al-Garawi, and P. Yu. Intelligent crawling on the world wide web with arbitrary predicates. In WWW-10, Hong Kong, 2001. Google ScholarDigital Library
- D. Bergmark, C. Lagoze, and A. Sbityakov. Focused crawls, tunneling, and digital libraries. In ACM European Conference on Digital Libraries, Rome, September 2002. Google ScholarDigital Library
- S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific web resource discovery. In WWW-8, 1999. Google ScholarDigital Library
- J. Cho, H. García-Molina, and L. Page. Efficient crawling through URL ordering. Computer Networks and ISDN Systems, 30(1--7):161--172, 1998. Google ScholarDigital Library
- H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. In 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02), Philadelphia, July 2002.Google Scholar
- M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused Crawling using Context Graphs. In VLDB-00, 2000, 2000. Google ScholarDigital Library
- M. Ester and M. Gross. Ariadne: a focused crawler with adaptive classification of the hyperlinks. In Nat. Symp. on Machine Learning (FGML '2000), Birlinghoven, 2000.Google Scholar
- S. Handschuh, A. Maedche, and S. Staab. CREAM --- Creating relational metadata with a component-based, ontology driven framework. In SWWS'01, Stanford, USA, August 2001.Google Scholar
- S. Handschuh, A. Maedche, L. Stojanovic, and R. Volz. KAON - The KArlsruhe ONtology and Semantic Web Infrastructure. Technical report, Forschungszentrum Informatik Karlsruhe, 2001. http://kaon.semanticweb.org.Google Scholar
- G. Neumann, R. Backofen, J. Baur, M. Becker, and C. Braun. An information extraction core system for real world german text processing. In ANLP-97, Washington, USA, 1997. Google ScholarDigital Library
- M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.Google ScholarCross Ref
- J. Rennie and A. McCallum. Using Reinforcement Learning to Spider the Web Efficiently. In ICML-99, 1999. Google ScholarDigital Library
- G. Salton. Automatic Text Processing. Add.-Wesley, 1988. Google ScholarDigital Library
Recommendations
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web EngineeringWeb crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Efficient Topical Focused Crawling Through Neighborhood Feature
AbstractA focused web crawler is an essential tool for gathering domain-specific data used by national web corpora, vertical search engines, and so on, since it is more efficient than general Breadth-First or Depth-First crawlers. The problem in focused ...
Focused crawling of tagged web resources using ontology
Scrutinizing web resources of interest from a large number of search results is a tedious task for any web user. Fortunately, social sites such as Social Bookmarking Site (SBS) allow web users to store their preferences and searched results of their ...
Comments