skip to main content
10.1145/1963405.1963436acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Prophiler: a fast filter for the large-scale detection of malicious web pages

Authors Info & Claims
Published:28 March 2011Publication History

ABSTRACT

Malicious web pages that host drive-by-download exploits have become a popular means for compromising hosts on the Internet and, subsequently, for creating large-scale botnets. In a drive-by-download exploit, an attacker embeds a malicious script (typically written in JavaScript) into a web page. When a victim visits this page, the script is executed and attempts to compromise the browser or one of its plugins. To detect drive-by-download exploits, researchers have developed a number of systems that analyze web pages for the presence of malicious code. Most of these systems use dynamic analysis. That is, they run the scripts associated with a web page either directly in a real browser (running in a virtualized environment) or in an emulated browser, and they monitor the scripts' executions for malicious activity. While the tools are quite precise, the analysis process is costly, often requiring in the order of tens of seconds for a single page. Therefore, performing this analysis on a large set of web pages containing hundreds of millions of samples can be prohibitive.

One approach to reduce the resources required for performing large-scale analysis of malicious web pages is to develop a fast and reliable filter that can quickly discard pages that are benign, forwarding to the costly analysis tools only the pages that are likely to contain malicious code. In this paper, we describe the design and implementation of such a filter. Our filter, called Prophiler, uses static analysis techniques to quickly examine a web page for malicious content. This analysis takes into account features derived from the HTML contents of a page, from the associated JavaScript code, and from the corresponding URL. We automatically derive detection models that use these features using machine-learning techniques applied to labeled datasets.

To demonstrate the effectiveness and efficiency of Prophiler, we crawled and collected millions of pages, which we analyzed for malicious behavior. Our results show that our filter is able to reduce the load on a more costly dynamic analysis tools by more than 85%, with a negligible amount of missed malicious pages.

References

  1. Alexa.com. Alexa Top Global Sites. http://www.alexa.com/topsites/.Google ScholarGoogle Scholar
  2. Clam AntiVirus. http://www.clamav.net/, 2010.Google ScholarGoogle Scholar
  3. A. Clark and M. Guillemot. CyberNeko HTML Parser. http://nekohtml.sourceforge.net/.Google ScholarGoogle Scholar
  4. M. Cova, C. Kruegel, and G. Vigna. Detection and Analysis of Drive-by-Download Attacks and Malicious JavaScript Code. In Proceedings of the International World Wide Web Conference (WWW), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Feinstein and D. Peck. Caffeine Monkey: Automated Collection, Detection and Analysis of Malicious JavaScript. In Proceedings of the Black Hat Security Conference, 2007.Google ScholarGoogle Scholar
  6. S. Garera, N. Provos, M. Chew, and A. D. Rubin. A Framework for Detection and Measurement of Phishing Attacks. In Proceedings of the Workshop on Rapid Malcode (WORM), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Goodin. SQL injection taints BusinessWeek.com. http://www.theregister.co.uk/2008/09/16/businessweek_hacked/, September 2008.Google ScholarGoogle Scholar
  8. D. Goodin. Potent malware link infects almost 300,000 webpages. http://www.theregister.co.uk/2009/12/10/mass_web_attack/, December 2010.Google ScholarGoogle Scholar
  9. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1):10--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Heritrix. http://crawler.archive.org/.Google ScholarGoogle Scholar
  11. M. Hines. Malware SEO: Gaming Google Trends and Big Bird. http://securitywatch.eweek.com/seo/malware_seo_gaming_google_trends_and_big_bird.html, November 2009.Google ScholarGoogle Scholar
  12. W. Hobson. Cyber-criminals use SEO on topical trends. http://www.vertical-leap.co.uk/news/cybercriminals-use-seo-on-topical-trends/, February 2010.Google ScholarGoogle Scholar
  13. HoneyClient Project Team. HoneyClient. http://www.honeyclient.org/, 2010.Google ScholarGoogle Scholar
  14. A. Ikinci, T. Holz, and F. Freiling. Monkey-Spider: Detecting Malicious Websites with Low-Interaction Honeyclients. In Proceedings of Sicherheit, Schutz und Zuverlässigkeit, 2008.Google ScholarGoogle Scholar
  15. JSUnpack. http://jsunpack.jeek.org, 2010.Google ScholarGoogle Scholar
  16. P. Likarish, E. Jung, and I. Jo. Obfuscated Malicious Javascript Detection using Classification Techniques. In Proceedings of the Conference on Malicious and Unwanted Software (Malware), 2009.Google ScholarGoogle ScholarCross RefCross Ref
  17. J. Ma, L. Saul, S. Savage, and G. Voelker. Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Moshchuk, T. Bragin, S. Gribble, and H. Levy. A Crawler-based Study of Spyware in the Web. In Proceedings of the Symposium on Network and Distributed System Security (NDSS), 2006.Google ScholarGoogle Scholar
  19. Mozilla Foundation. Rhino: JavaScript for Java. http://www.mozilla.org/rhino/.Google ScholarGoogle Scholar
  20. J. Nazario. PhoneyC: A Virtual Client Honeypot. In Proceedings of the USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Oswald. HTMLParser. http://htmlparser.sourceforge.net/.Google ScholarGoogle Scholar
  22. N. Provos, P. Mavrommatis, M. A. Rajab, and F. Monrose. All Your iFrames Point to Us. In Proceedings of the USENIX Security Symposium, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Ratanaworabhan, B. Livshits, B., and Zorn. Nozzle: a defense against heap-spraying code injection attacks. In Proceedings of the USENIX Security Symposium, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. K. Rieck, T. Krueger, and A. Dewald. CUJO: Efficient Detection and Prevention of Drive-by-Download Attacks. In Proceedings of the Annual Computer Security Applications Conference (ACSAC), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Seifert and R. Steenson. Capture-HPC. https://projects.honeynet.org/capture-hpc, 2008.Google ScholarGoogle Scholar
  26. C. Seifert, I. Welch, and P. Komisarczuk. Identification of Malicious Web Pages Through Analysis of Underlying DNS and Web Server Relationships. In Proceedings of the LCN Workshop on Network Security (WNS), 2008.Google ScholarGoogle ScholarCross RefCross Ref
  27. C. Seifert, I. Welch, and P. Komisarczuk. Identification of Malicious Web Pages with Static Heuristics. In Proceedings of the Australasian Telecommunication Networks and Applications Conference (ATNAC), 2008.Google ScholarGoogle ScholarCross RefCross Ref
  28. R. Sommer and V. Paxson. Outside the Closed World: On Using Machine Learning For Network Intrusion Detection. In Proceedings of the IEEE Symposium on Security and Privacy, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. B. Stone-Gross, M. Cova, L. Cavallaro, B. Gilbert, M. Szydlowski, R. Kemmerer, C. Kruegel, and G. Vigna. Your Botnet is My Botnet: Analysis of a Botnet Takeover. In Proceedings of the ACM Conference on Computer and Communications Security (CCS), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Y.-M. Wang, D. Beck, X. Jiang, R. Roussev, C. Verbowski, S. Chen, and S. King. Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites that Exploit Browser Vulnerabilities. In Proceedings of the Symposium on Network and Distributed System Security (NDSS), 2006.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    WWW '11: Proceedings of the 20th international conference on World wide web
    March 2011
    840 pages
    ISBN:9781450306324
    DOI:10.1145/1963405

    Copyright © 2011 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 28 March 2011

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate1,899of8,196submissions,23%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader