skip to main content
10.1145/1014052.1014105acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Learning to detect malicious executables in the wild

Published:22 August 2004Publication History

ABSTRACT

In this paper, we describe the development of a fielded application for detecting malicious executables in the wild. We gathered 1971 benign and 1651 malicious executables and encoded each as a training example using n-grams of byte codes as features. Such processing resulted in more than 255 million distinct n-grams. After selecting the most relevant n-grams for prediction, we evaluated a variety of inductive methods, including naive Bayes, decision trees, support vector machines, and boosting. Ultimately, boosted decision trees outperformed other methods with an area under the roc curve of 0.996. Results also suggest that our methodology will scale to larger collections of executables. To the best of our knowledge, ours is the only fielded application for this task developed using techniques from machine learning and data mining.

References

  1. D. Aha, D. Kibler, and M. Albert. Instance-based learning algorithms. Machine Learning, 6:37--66, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Aiken. MOSS: A system for detecting software plagiarism. Software, Department of Computer Science, University of California, Berkeley, http://www.cs.berkeley.edu/~aiken/moss.html, 1994.Google ScholarGoogle Scholar
  3. Anonymous. Maximum Security. Sams Publishing, Indianapolis, IN, 4th edition, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. Bauer and R. Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36:105--139, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fourth Workshop on Computational Learning Theory, pages 144--152, New York, NY, 1992. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Breiman. Arcing classifiers. The Annals of Statistics, 26:801--849, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  7. M. Christodorescu and S. Jha. Static analysis of executables to detect malicious patterns. In Proceedings of the Twelfth USENIX Security Symposium, Berkeley, CA, 2003. Advanced Computing Systems Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W. Cohen. Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning, pages 115--123, San Francisco, CA, 1995. Morgan Kaufmann.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40:139--158, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29:103--130, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Drummond and R. Holte. Explicitly representing expected cost: An alternative to ROC representation. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 198--207, New York, NY, 2000. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the Seventh International Conference on Information and Knowledge Management, pages 148--155, New York, NY, 1998. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. Durning-Lawrence. Bacon is Shake-speare. The John McBride Company, New York, NY, 1910.Google ScholarGoogle Scholar
  14. Y. Freund and R. Schapire. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 148--156, San Francisco, CA, 1996. Morgan Kaufmann.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Gray, P. Sallis, and S. MacDonell. Software forensics: Extending authorship analysis techniques to computer programs. In Proceedings of the Third Biannual Conference of the International Association of Forensic Linguists, pages 1--8, Birmingham, UK, 1997. International Association of Forensic Linguists.Google ScholarGoogle Scholar
  16. D. Grossman and O. Frieder. Information retrieval: Algorithms and heuristics. Kluwer Academic Publishers, Boston, MA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Hand, H. Mannila, and P. Smyth. Principles of data mining. MIT Press, Cambridge, MA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Jain, R. Duin, and J. Mao. Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:4--37, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. H. Jankowitz. Detecting plagiarism in student Pascal programs. Computer Journal, 31:1--8, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the Tenth European Conference on Machine Learning, pages 487--494, Berlin, 1998. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Kephart, G. Sorkin, W. Arnold, D. Chess, G. Tesauro, and S. White. Biologically inspired defenses against computer viruses. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 985--996, San Francisco, CA, 1995. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. B. Kjell, W. Woods., and O. Frieder. Discrimination of authorship using visualization. Information Processing and Management, 30:141--150, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. I. Krsul. Authorship analysis: Identifying the author of a program. Master's thesis, Purdue University, West Lafayette, IN, 1994.Google ScholarGoogle Scholar
  24. I. Krsul and E. Spafford. Authorship analysis: Identifying the authors of a program. In Proceedings of the Eighteenth National Information Systems Security Conference, pages 514--524, Gaithersburg, MD, 1995. National Institute of Standards and Technology.Google ScholarGoogle Scholar
  25. R. Lo, K. Levitt, and R. Olsson. MCF: A malicious code filter. Computers & Security, 14:541--566, 1995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Maron and J. Kuhns. On relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7:216--244, 1960. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. McGraw and G. Morisett. Attacking malicious code: A report to the Infosec Research Council. IEEE Software, pages 33--41, September/October 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. Metz, Y. Jiang, H. MacMahon, R. Nishikawa, and X. Pan. ROC software. Web page, Kurt Rossmann Laboratories for Radiologic Image Research, University of Chicago, Chicago, IL, 2003.Google ScholarGoogle Scholar
  29. P. Miller. hexdump 1.4. Software, http://gd.tuwien.ac.at/softeng/Aegis/hexdump.html, 1999.Google ScholarGoogle Scholar
  30. T. Mitchell. Machine Learning. McGraw-Hill, New York, NY, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Opitz and R. Maclin. Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11:169--198, 1999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. Burges, and S. Mika, editors, Advances in Kernel Methods---Support Vector Learning. MIT Press, Cambridge, MA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Platt. Probabilities for SV machines. In P. Bartlett, B. Schölkopf, D. Schuurmans, and A. Smola, editors, Advances in Large-Margin Classifiers, pages 61--74. MIT Press, Cambridge, MA, 2000.Google ScholarGoogle Scholar
  34. F. Provost and T. Fawcett. Robust classification for imprecise environments. Machine Learning, 42:203--231, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann, San Francisco, CA, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the 1998 AAAI Workshop, Menlo Park, CA, 1998. AAAI Press. Technical Report WS-98-05.Google ScholarGoogle Scholar
  37. M. Schultz, E. Eskin, E. Zadok, and S. Stolfo. Data mining methods for detection of new malicious executables. In Proceedings of the IEEE Symposium on Security and Privacy, pages 38--49, Los Alamitos, CA, 2001. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. Soman, C. Krintz, and G. Vigna. Detecting malicious Java code using virtual machine auditing. In Proceedings of the Twelfth USENIX Security Symposium, Berkeley, CA, 2003. Advanced Computing Systems Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. E. Spafford and S. Weeber. Software forensics: Can we track code to its authors? Computers & Security, 12:585--595, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Swets and R. Pickett. Evaluation of diagnostic systems: Methods from signal detection theory. Academic Press, New York, NY, 1982.Google ScholarGoogle Scholar
  41. G. Tesauro, J. Kephart, and G. Sorkin. Neural networks for computer virus recognition. IEEE Expert, 11:5--6, August 1996.Google ScholarGoogle ScholarCross RefCross Ref
  42. I. Witten and E. Frank. Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco, CA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Y. Yang and J. Pederson. A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 412--420, San Francisco, CA, 1997. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Learning to detect malicious executables in the wild

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
            August 2004
            874 pages
            ISBN:1581138881
            DOI:10.1145/1014052

            Copyright © 2004 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 22 August 2004

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            Overall Acceptance Rate1,133of8,635submissions,13%

            Upcoming Conference

            KDD '24

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader