ABSTRACT
Web sites have adopted a variety of adversarial techniques to prevent web crawlers from retrieving their content. While it is possible to simulate users behavior using a browser to crawl such sites, this approach is not scalable. Therefore, understanding existing adversarial techniques is important to design crawling strategies that can adapt to retrieve the content as efficiently as possible. Ideally, a web crawler should detect the nature of the adversarial policies and select the most cost-effective means to defeat them.
In this paper, we discuss the results of a large-scale study of web site behavior based on their responses to different user-agents. We issued over 9 million HTTP GET requests to 1.3 million unique web sites from DMOZ using six different user-agents and the TOR network as an anonymous proxy. We observed that web sites do change their responses depending on user-agents and IP addresses. This suggests that probing sites for these features can be an effective means to detect adversarial techniques.
- D. Doran and S. S. Gokhale. Web robot detection techniques: Overview and limitations. Data Mining Knowledge Discovery, pages 183--210, Jan. 2011. Google ScholarDigital Library
- C. L. Giles, Y. Sun, and I. G. Councill. Measuring the web crawler ethics. In roceedings of the 19th International Conference on World Wide Web, pages 1101--1102, 2010. Google ScholarDigital Library
- B. Jones, T.-W. Lee, N. Feamster, and P. Gill. Automated detection and fingerprinting of censorship block pages. In Proceedings of the 2014 Conference on Internet Measurement Conference, pages 299--304, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- S. Kolay, P. D'Alberto, A. Dasdan, and A. Bhattacharjee. A larger scale study of robots.txt. In Proceedings of the 17th International Conference on World Wide Web, pages 1171--1172, 2008. Google ScholarDigital Library
- K. Park, V. S. Pai, K.-W. Lee, and S. Calo. Securing web service by automatic robot detection. In Proceedings of the Annual Conference on USENIX '06 Annual Technical Conference, pages 23--23, 2006. Google ScholarDigital Library
- A. Stassopoulou and M. D. Dikaiakos. Web robot detection: A probabilistic reasoning approach. Computer Networks, pages 265--278, 2009. Google ScholarDigital Library
- P.-N. Tan and V. Kumar. Discovery of web robot sessions based on their navigational patterns. Data Mining Knowledge Discovery, pages 9--35, 2002. Google ScholarDigital Library
- D. Y. Wang, S. Savage, and G. M. Voelker. Cloak and dagger: Dynamics of web search cloaking. In Proceedings of the 18th ACM Conference on Computer and Communications Security, pages 477--490, 2011. Google ScholarDigital Library
- B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. In Proceedings of the first International Workshop on Adversarial Information Retrieval on the Web, Chiba, Japan, 2005.Google Scholar
Index Terms
- Understanding Website Behavior based on User Agent
Recommendations
W3C user agent accessibility guidelines 1.0 for graphical Web browsers
Web browsers and multimedia players play a critical role in making Web content accessible to people with disabilities. Access to Web content requires that Web browsers provide users with final control over the styling of rendered content, the type of ...
Detection of malicious and non-malicious website visitors using unsupervised neural network learning
Distributed denials of service (DDoS) attacks are recognized as one of the most damaging attacks on the Internet security today. Recently, malicious web crawlers have been used to execute automated DDoS attacks on web sites across the WWW. In this study,...
Efficient Topical Focused Crawling Through Neighborhood Feature
AbstractA focused web crawler is an essential tool for gathering domain-specific data used by national web corpora, vertical search engines, and so on, since it is more efficient than general Breadth-First or Depth-First crawlers. The problem in focused ...
Comments