ABSTRACT
This work aims to provide a page segmentation algorithm which uses both visual and content information to extract the semantic structure of a web page. The visual information is utilized using the VIPS algorithm and the content information using a pre-trained Naive Bayes classifier. The output of the algorithm is a semantic structure tree whose leaves represent segments having unique topic. However contents of the leaf segments may possibly be physically distributed in the web page. This structure can be useful in many web applications like information retrieval, information extraction and automatic web page adaptation. This algorithm is expected to outperform other existing page segmentation algorithms since it utilizes both content and visual information.
- Beeferman, D., Berger, A., and Lafferty, J., Statistical models for text segmentation. 34(1-3):177--210, 1999. Google ScholarDigital Library
- J.-R. W. D. Cai, Yu, S., W.-Y. Ma., Extracting content structure for web pages based on visual representation. Proc. 5th Asia Pacific Web Conf, Xi'an, China, 2003. Google ScholarDigital Library
- Mitchell, T., Machine Learning, McGraw-Hill, NY, 1997. Google ScholarDigital Library
- Open directory project. http://dmoz.org/.Google Scholar
Index Terms
- Extracting semantic structure of web documents using content and visual information
Recommendations
Mining web site's topic hierarchy
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide WebSearching and navigating a Web site is a tedious task and the hierarchical models, such as site maps, are frequently used for organizing the Web site's content. In this work, we propose to model a Web site's content structure using the topic hierarchy, ...
Extracting a website's content structure from its link structure
CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge managementHierarchical models are commonly used to organize a Website's content. A Website's content structure can be represented by a topic hierarchy, a directed tree rooted at a Website's homepage in which the vertices and edges correspond to Web pages and ...
Web informative content identification and filtering using machine learning technique
Internet has gained greatest acceptance as reservoirs of information. It has been observed that the web page along with main content comprises of noise advertisement, external links, which poses difficulty for various search engines crawlers to ...
Comments