ABSTRACT
The continued development and maturation of advanced HTML features such as Cascading style sheets (CSS), Javascript, and AJAX, as well as their widespread adoption by browsers, has enabled web pages to flourish with sophistication and interactivity. Unfortunately, this presents challenges to the web search community, as a web page's representation in the browser (i.e., what users see) can diverge dramatically from its raw HTML content (i.e., what search engines index and retrieve). For example, interactive pages may contain content in regions that are not visible before a user action, such as focusing a tab, but which are nonetheless still contained within the raw HTML. We study this divergence by comparing raw HTML to its fully rendered form across a number of metrics spanning presentation, geometry, and content, using a large, representative sample of popular web pages. We find that a large divergence currently exists, and we show via a historical analysis that this divergence has grown more pronounced over the last decade. The general finding of our study is that continuing to index the web via simple HTML parsing will diminish the effectiveness of retrieval on the modern web, and that the IR community should work toward more sophisticated web page processing in indexing technology.
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW '98, pages 107--117, Amsterdam, The Netherlands, 1998. Elsevier Science Publishers B. V. Google ScholarDigital Library
- C. Clarke, N. Craswell, and I. Soboroff. Overview of the trec 2009 web track. In Proceedings of TREC 2009, 2010.Google Scholar
- D. Fernandes, E. S. de Moura, A. S. da Silva, B. Ribeiro-Neto, and E. Braga. A site oriented method for segmenting web pages. In SIGIR '11, pages 215--224, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- F. Sun, D. Song, and L. Liao. Dom based content extraction via text density. In SIGIR '11, pages 245--254, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- K. Wang, X. Li, and J. Gao. Multi-style language model for web scale information retrieval. In SIGIR '10, pages 467--474, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
Index Terms
- The downside of markup: examining the harmful effects of CSS and javascript on indexing today's web
Recommendations
Enabling web browsers to augment web sites' filtering and sorting functionalities
UIST '06: Proceedings of the 19th annual ACM symposium on User interface software and technologyExisting augmentations of web pages are mostly small cosmetic changes (e.g., removing ads) and minor addition of third-party content (e.g., product prices from competing sites). None leverages the structured data presented in web pages. This paper ...
A voice-controlled web browser to navigate hierarchical hidden menus of web pages in a smart-tv environment
WWW '14 Companion: Proceedings of the 23rd International Conference on World Wide WebThis paper proposes a new voice web browser that can be operated in smart TV environments. Previous voice web browsers had the limitation of being run under limited conditions; for example, a list of the specific contents of a page was outputted by ...
Comments