ABSTRACT
In this paper, we describe the development of a fielded application for detecting malicious executables in the wild. We gathered 1971 benign and 1651 malicious executables and encoded each as a training example using n-grams of byte codes as features. Such processing resulted in more than 255 million distinct n-grams. After selecting the most relevant n-grams for prediction, we evaluated a variety of inductive methods, including naive Bayes, decision trees, support vector machines, and boosting. Ultimately, boosted decision trees outperformed other methods with an area under the roc curve of 0.996. Results also suggest that our methodology will scale to larger collections of executables. To the best of our knowledge, ours is the only fielded application for this task developed using techniques from machine learning and data mining.
- D. Aha, D. Kibler, and M. Albert. Instance-based learning algorithms. Machine Learning, 6:37--66, 1991. Google ScholarDigital Library
- A. Aiken. MOSS: A system for detecting software plagiarism. Software, Department of Computer Science, University of California, Berkeley, http://www.cs.berkeley.edu/~aiken/moss.html, 1994.Google Scholar
- Anonymous. Maximum Security. Sams Publishing, Indianapolis, IN, 4th edition, 2003. Google ScholarDigital Library
- B. Bauer and R. Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36:105--139, 1999. Google ScholarDigital Library
- B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fourth Workshop on Computational Learning Theory, pages 144--152, New York, NY, 1992. ACM Press. Google ScholarDigital Library
- L. Breiman. Arcing classifiers. The Annals of Statistics, 26:801--849, 1998.Google ScholarCross Ref
- M. Christodorescu and S. Jha. Static analysis of executables to detect malicious patterns. In Proceedings of the Twelfth USENIX Security Symposium, Berkeley, CA, 2003. Advanced Computing Systems Association. Google ScholarDigital Library
- W. Cohen. Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning, pages 115--123, San Francisco, CA, 1995. Morgan Kaufmann.Google ScholarDigital Library
- T. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40:139--158, 2000. Google ScholarDigital Library
- P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29:103--130, 1997. Google ScholarDigital Library
- C. Drummond and R. Holte. Explicitly representing expected cost: An alternative to ROC representation. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 198--207, New York, NY, 2000. ACM Press. Google ScholarDigital Library
- S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the Seventh International Conference on Information and Knowledge Management, pages 148--155, New York, NY, 1998. ACM Press. Google ScholarDigital Library
- E. Durning-Lawrence. Bacon is Shake-speare. The John McBride Company, New York, NY, 1910.Google Scholar
- Y. Freund and R. Schapire. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 148--156, San Francisco, CA, 1996. Morgan Kaufmann.Google ScholarDigital Library
- A. Gray, P. Sallis, and S. MacDonell. Software forensics: Extending authorship analysis techniques to computer programs. In Proceedings of the Third Biannual Conference of the International Association of Forensic Linguists, pages 1--8, Birmingham, UK, 1997. International Association of Forensic Linguists.Google Scholar
- D. Grossman and O. Frieder. Information retrieval: Algorithms and heuristics. Kluwer Academic Publishers, Boston, MA, 1998. Google ScholarDigital Library
- D. Hand, H. Mannila, and P. Smyth. Principles of data mining. MIT Press, Cambridge, MA, 2001. Google ScholarDigital Library
- A. Jain, R. Duin, and J. Mao. Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:4--37, 2000. Google ScholarDigital Library
- H. Jankowitz. Detecting plagiarism in student Pascal programs. Computer Journal, 31:1--8, 1988. Google ScholarDigital Library
- T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the Tenth European Conference on Machine Learning, pages 487--494, Berlin, 1998. Springer-Verlag. Google ScholarDigital Library
- J. Kephart, G. Sorkin, W. Arnold, D. Chess, G. Tesauro, and S. White. Biologically inspired defenses against computer viruses. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 985--996, San Francisco, CA, 1995. Morgan Kaufmann. Google ScholarDigital Library
- B. Kjell, W. Woods., and O. Frieder. Discrimination of authorship using visualization. Information Processing and Management, 30:141--150, 1994. Google ScholarDigital Library
- I. Krsul. Authorship analysis: Identifying the author of a program. Master's thesis, Purdue University, West Lafayette, IN, 1994.Google Scholar
- I. Krsul and E. Spafford. Authorship analysis: Identifying the authors of a program. In Proceedings of the Eighteenth National Information Systems Security Conference, pages 514--524, Gaithersburg, MD, 1995. National Institute of Standards and Technology.Google Scholar
- R. Lo, K. Levitt, and R. Olsson. MCF: A malicious code filter. Computers & Security, 14:541--566, 1995.Google ScholarDigital Library
- M. Maron and J. Kuhns. On relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7:216--244, 1960. Google ScholarDigital Library
- G. McGraw and G. Morisett. Attacking malicious code: A report to the Infosec Research Council. IEEE Software, pages 33--41, September/October 2000. Google ScholarDigital Library
- C. Metz, Y. Jiang, H. MacMahon, R. Nishikawa, and X. Pan. ROC software. Web page, Kurt Rossmann Laboratories for Radiologic Image Research, University of Chicago, Chicago, IL, 2003.Google Scholar
- P. Miller. hexdump 1.4. Software, http://gd.tuwien.ac.at/softeng/Aegis/hexdump.html, 1999.Google Scholar
- T. Mitchell. Machine Learning. McGraw-Hill, New York, NY, 1997. Google ScholarDigital Library
- D. Opitz and R. Maclin. Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11:169--198, 1999.Google ScholarDigital Library
- J. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. Burges, and S. Mika, editors, Advances in Kernel Methods---Support Vector Learning. MIT Press, Cambridge, MA, 1998. Google ScholarDigital Library
- J. Platt. Probabilities for SV machines. In P. Bartlett, B. Schölkopf, D. Schuurmans, and A. Smola, editors, Advances in Large-Margin Classifiers, pages 61--74. MIT Press, Cambridge, MA, 2000.Google Scholar
- F. Provost and T. Fawcett. Robust classification for imprecise environments. Machine Learning, 42:203--231, 2001. Google ScholarDigital Library
- J. Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann, San Francisco, CA, 1993. Google ScholarDigital Library
- M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the 1998 AAAI Workshop, Menlo Park, CA, 1998. AAAI Press. Technical Report WS-98-05.Google Scholar
- M. Schultz, E. Eskin, E. Zadok, and S. Stolfo. Data mining methods for detection of new malicious executables. In Proceedings of the IEEE Symposium on Security and Privacy, pages 38--49, Los Alamitos, CA, 2001. IEEE Press. Google ScholarDigital Library
- S. Soman, C. Krintz, and G. Vigna. Detecting malicious Java code using virtual machine auditing. In Proceedings of the Twelfth USENIX Security Symposium, Berkeley, CA, 2003. Advanced Computing Systems Association. Google ScholarDigital Library
- E. Spafford and S. Weeber. Software forensics: Can we track code to its authors? Computers & Security, 12:585--595, 1993. Google ScholarDigital Library
- J. Swets and R. Pickett. Evaluation of diagnostic systems: Methods from signal detection theory. Academic Press, New York, NY, 1982.Google Scholar
- G. Tesauro, J. Kephart, and G. Sorkin. Neural networks for computer virus recognition. IEEE Expert, 11:5--6, August 1996.Google ScholarCross Ref
- I. Witten and E. Frank. Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco, CA, 2000. Google ScholarDigital Library
- Y. Yang and J. Pederson. A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 412--420, San Francisco, CA, 1997. Morgan Kaufmann. Google ScholarDigital Library
Index Terms
- Learning to detect malicious executables in the wild
Recommendations
Learning to Detect and Classify Malicious Executables in the Wild
We describe the use of machine learning and data mining to detect and classify malicious executables as they appear in the wild. We gathered 1,971 benign and 1,651 malicious executables and encoded each as a training example using n-grams of byte codes ...
Static analysis of executables to detect malicious patterns
SSYM'03: Proceedings of the 12th conference on USENIX Security Symposium - Volume 12Malicious code detection is a crucial component of any defense mechanism. In this paper, we present a unique viewpoint on malicious code detection. We regard malicious code detection as an obfuscation-deobfuscation game between malicious code writers ...
An artificial intelligence membrane to detect network intrusion
We propose an artificial intelligence membrane to detect network intrusion, which is analogous to a biological membrane that prevents viruses from entering cells. This artificial membrane is designed to monitor incoming packets and to prevent a ...
Comments