skip to main content
10.1145/2396761.2398558acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper

The downside of markup: examining the harmful effects of CSS and javascript on indexing today's web

Published:29 October 2012Publication History

ABSTRACT

The continued development and maturation of advanced HTML features such as Cascading style sheets (CSS), Javascript, and AJAX, as well as their widespread adoption by browsers, has enabled web pages to flourish with sophistication and interactivity. Unfortunately, this presents challenges to the web search community, as a web page's representation in the browser (i.e., what users see) can diverge dramatically from its raw HTML content (i.e., what search engines index and retrieve). For example, interactive pages may contain content in regions that are not visible before a user action, such as focusing a tab, but which are nonetheless still contained within the raw HTML. We study this divergence by comparing raw HTML to its fully rendered form across a number of metrics spanning presentation, geometry, and content, using a large, representative sample of popular web pages. We find that a large divergence currently exists, and we show via a historical analysis that this divergence has grown more pronounced over the last decade. The general finding of our study is that continuing to index the web via simple HTML parsing will diminish the effectiveness of retrieval on the modern web, and that the IR community should work toward more sophisticated web page processing in indexing technology.

References

  1. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW '98, pages 107--117, Amsterdam, The Netherlands, 1998. Elsevier Science Publishers B. V. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Clarke, N. Craswell, and I. Soboroff. Overview of the trec 2009 web track. In Proceedings of TREC 2009, 2010.Google ScholarGoogle Scholar
  3. D. Fernandes, E. S. de Moura, A. S. da Silva, B. Ribeiro-Neto, and E. Braga. A site oriented method for segmenting web pages. In SIGIR '11, pages 215--224, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. F. Sun, D. Song, and L. Liao. Dom based content extraction via text density. In SIGIR '11, pages 245--254, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. Wang, X. Li, and J. Gao. Multi-style language model for web scale information retrieval. In SIGIR '10, pages 467--474, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The downside of markup: examining the harmful effects of CSS and javascript on indexing today's web

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management
      October 2012
      2840 pages
      ISBN:9781450311564
      DOI:10.1145/2396761

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 29 October 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper

      Acceptance Rates

      Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader