skip to main content
10.1145/2911451.2914757acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper
Public Access

Understanding Website Behavior based on User Agent

Published:07 July 2016Publication History

ABSTRACT

Web sites have adopted a variety of adversarial techniques to prevent web crawlers from retrieving their content. While it is possible to simulate users behavior using a browser to crawl such sites, this approach is not scalable. Therefore, understanding existing adversarial techniques is important to design crawling strategies that can adapt to retrieve the content as efficiently as possible. Ideally, a web crawler should detect the nature of the adversarial policies and select the most cost-effective means to defeat them.

In this paper, we discuss the results of a large-scale study of web site behavior based on their responses to different user-agents. We issued over 9 million HTTP GET requests to 1.3 million unique web sites from DMOZ using six different user-agents and the TOR network as an anonymous proxy. We observed that web sites do change their responses depending on user-agents and IP addresses. This suggests that probing sites for these features can be an effective means to detect adversarial techniques.

References

  1. D. Doran and S. S. Gokhale. Web robot detection techniques: Overview and limitations. Data Mining Knowledge Discovery, pages 183--210, Jan. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. L. Giles, Y. Sun, and I. G. Councill. Measuring the web crawler ethics. In roceedings of the 19th International Conference on World Wide Web, pages 1101--1102, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. Jones, T.-W. Lee, N. Feamster, and P. Gill. Automated detection and fingerprinting of censorship block pages. In Proceedings of the 2014 Conference on Internet Measurement Conference, pages 299--304, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Kolay, P. D'Alberto, A. Dasdan, and A. Bhattacharjee. A larger scale study of robots.txt. In Proceedings of the 17th International Conference on World Wide Web, pages 1171--1172, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. Park, V. S. Pai, K.-W. Lee, and S. Calo. Securing web service by automatic robot detection. In Proceedings of the Annual Conference on USENIX '06 Annual Technical Conference, pages 23--23, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Stassopoulou and M. D. Dikaiakos. Web robot detection: A probabilistic reasoning approach. Computer Networks, pages 265--278, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P.-N. Tan and V. Kumar. Discovery of web robot sessions based on their navigational patterns. Data Mining Knowledge Discovery, pages 9--35, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Y. Wang, S. Savage, and G. M. Voelker. Cloak and dagger: Dynamics of web search cloaking. In Proceedings of the 18th ACM Conference on Computer and Communications Security, pages 477--490, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. In Proceedings of the first International Workshop on Adversarial Information Retrieval on the Web, Chiba, Japan, 2005.Google ScholarGoogle Scholar

Index Terms

  1. Understanding Website Behavior based on User Agent

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval
        July 2016
        1296 pages
        ISBN:9781450340694
        DOI:10.1145/2911451

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 7 July 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • short-paper

        Acceptance Rates

        SIGIR '16 Paper Acceptance Rate62of341submissions,18%Overall Acceptance Rate792of3,983submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader