Understanding Website Behavior based on User Agent

Authors:
Kien Pham

New York University, New York, NY, USA

New York University, New York, NY, USA
View Profile

,
Aécio Santos

New York University, New York, NY, USA

New York University, New York, NY, USA
View Profile

,
Juliana Freire

New York University, New York, NY, USA

New York University, New York, NY, USA
View Profile

SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information RetrievalJuly 2016Pages 1053–1056https://doi.org/10.1145/2911451.2914757

Published:07 July 2016Publication History

SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

Pages 1053–1056

ABSTRACT

Web sites have adopted a variety of adversarial techniques to prevent web crawlers from retrieving their content. While it is possible to simulate users behavior using a browser to crawl such sites, this approach is not scalable. Therefore, understanding existing adversarial techniques is important to design crawling strategies that can adapt to retrieve the content as efficiently as possible. Ideally, a web crawler should detect the nature of the adversarial policies and select the most cost-effective means to defeat them.

In this paper, we discuss the results of a large-scale study of web site behavior based on their responses to different user-agents. We issued over 9 million HTTP GET requests to 1.3 million unique web sites from DMOZ using six different user-agents and the TOR network as an anonymous proxy. We observed that web sites do change their responses depending on user-agents and IP addresses. This suggests that probing sites for these features can be an effective means to detect adversarial techniques.

References

D. Doran and S. S. Gokhale. Web robot detection techniques: Overview and limitations. Data Mining Knowledge Discovery, pages 183--210, Jan. 2011. Google ScholarDigital Library
C. L. Giles, Y. Sun, and I. G. Councill. Measuring the web crawler ethics. In roceedings of the 19th International Conference on World Wide Web, pages 1101--1102, 2010. Google ScholarDigital Library
B. Jones, T.-W. Lee, N. Feamster, and P. Gill. Automated detection and fingerprinting of censorship block pages. In Proceedings of the 2014 Conference on Internet Measurement Conference, pages 299--304, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
S. Kolay, P. D'Alberto, A. Dasdan, and A. Bhattacharjee. A larger scale study of robots.txt. In Proceedings of the 17th International Conference on World Wide Web, pages 1171--1172, 2008. Google ScholarDigital Library
K. Park, V. S. Pai, K.-W. Lee, and S. Calo. Securing web service by automatic robot detection. In Proceedings of the Annual Conference on USENIX '06 Annual Technical Conference, pages 23--23, 2006. Google ScholarDigital Library
A. Stassopoulou and M. D. Dikaiakos. Web robot detection: A probabilistic reasoning approach. Computer Networks, pages 265--278, 2009. Google ScholarDigital Library
P.-N. Tan and V. Kumar. Discovery of web robot sessions based on their navigational patterns. Data Mining Knowledge Discovery, pages 9--35, 2002. Google ScholarDigital Library
D. Y. Wang, S. Savage, and G. M. Voelker. Cloak and dagger: Dynamics of web search cloaking. In Proceedings of the 18th ACM Conference on Computer and Communications Security, pages 477--490, 2011. Google ScholarDigital Library
B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. In Proceedings of the first International Workshop on Adversarial Information Retrieval on the Web, Chiba, Japan, 2005.Google Scholar

Index Terms

Understanding Website Behavior based on User Agent
1. Information systems
  1. Information retrieval
    1. Search engine architectures and scalability
      1. Adversarial retrieval
  2. World Wide Web
    1. Web searching and information discovery
      1. Web search engines
        Web crawling

Recommendations

W3C user agent accessibility guidelines 1.0 for graphical Web browsers

Web browsers and multimedia players play a critical role in making Web content accessible to people with disabilities. Access to Web content requires that Web browsers provide users with final control over the styling of rendered content, the type of ...
Read More
Detection of malicious and non-malicious website visitors using unsupervised neural network learning

Distributed denials of service (DDoS) attacks are recognized as one of the most damaging attacks on the Internet security today. Recently, malicious web crawlers have been used to execute automated DDoS attacks on web sites across the WWW. In this study,...
Read More
Efficient Topical Focused Crawling Through Neighborhood Feature
Abstract
A focused web crawler is an essential tool for gathering domain-specific data used by national web corpora, vertical search engines, and so on, since it is more efficient than general Breadth-First or Depth-First crawlers. The problem in focused ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval
July 2016
1296 pages
ISBN:9781450340694
DOI:10.1145/2911451
General Chairs:
Raffaele Perego
ISTI-CNR, Italy
,
Fabrizio Sebastiani
Qatar Computing Research Institute, HBKU, Qatar
,
Program Chairs:
Javed Aslam
Northeastern University, US
,
Ian Ruthven
University of Strathclyde, UK
,
Justin Zobel
University of Melbourne, Australia
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 July 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
adversarial crawling
focused crawler
stealth crawling
user agent
web cloaking
web crawler detection
Qualifiers
- short-paper
Conference

Acceptance Rates
SIGIR '16 Paper Acceptance Rate62of341submissions,18%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 1,083
  Total Downloads
- Downloads (Last 12 months)64
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Understanding Website Behavior based on User Agent

SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

W3C user agent accessibility guidelines 1.0 for graphical Web browsers

Detection of malicious and non-malicious website visitors using unsupervised neural network learning

Efficient Topical Focused Crawling Through Neighborhood Feature