Automated Model Learning for Accurate Detection of Malicious Digital Documents

Authors:
Daniel Scofield

Assured Information Security, Suite, Portland, OR

Assured Information Security, Suite, Portland, OR
View Profile

,
Craig Miles

Assured Information Security, Suite, Portland, OR

Assured Information Security, Suite, Portland, OR

0000-0002-8648-803X
View Profile

,
Stephen Kuhn

Air Force Research Laboratory, Rome, NY

Air Force Research Laboratory, Rome, NY
View Profile

Authors Info & Claims

Digital Threats: Research and Practice Volume 1 Issue 3Article No.: 15pp 1–21https://doi.org/10.1145/3379505

Published:04 August 2020Publication History

Digital Threats: Research and Practice

Abstract

Modern cyber attacks are often conducted by distributing digital documents that contain malware. The approach detailed herein, which consists of a classifier that uses features derived from dynamic analysis of a document viewer as it renders the document in question, is capable of classifying the disposition of digital documents with greater than 98% accuracy even when its model is trained on just small amounts of data. To keep the classification model itself small and thereby to provide scalability, we employ an entity resolution strategy that merges syntactically disparate features that are thought to be semantically equivalent but vary due to programmatic randomness. Entity resolution enables construction of a comprehensive model of benign functionality using relatively few training documents, and the model does not improve significantly with additional training data. In particular, we describe and quantitatively evaluate a fully automated, document format--agnostic approach for learning a classification model that provides efficacious malicious document detection.

References

Ross Anderson, Chris Barton, Rainer Böhme, Richard Clayton, Michel J. G. Van Eeten, Michael Levi, Tyler Moore, and Stefan Savage. 2013. Measuring the cost of cybercrime. In The Economics of Information Security and Privacy. Springer, 265--300.Google Scholar
Michael Bailey, Jon Oberheide, Jon Andersen, Z. Morley Mao, Farnam Jahanian, and Jose Nazario. 2007. Automated classification and analysis of internet malware. In Proceedings of the International Workshop on Recent Advances in Intrusion Detection. Springer, 178--197.Google Scholar
Ahmad Bazzi and Yoshikuni Onozato. 2013. IDS for detecting malicious non-executable files using dynamic analysis. In Proceedings of the Asia-Pacific Network Operations and Management Symposium (APNOMS’13). 1--3.Google Scholar
Vitaly Chipounov, Volodymyr Kuznetsov, and George Candea. 2012. The S2E platform: Design, implementation, and applications. ACM Trans. Comput. Syst. 30, 1 (2012), 2.Google Scholar
Rudi Cilibrasi and Paul M. B. Vitányi. 2005. Clustering by compression. IEEE Trans. Inf. Theor. 51, 4 (2005), 1523--1545.Google ScholarDigital Library
Artem Dinaburg, Paul Royal, Monirul Sharif, and Wenke Lee. 2008. Ether: Malware analysis via hardware virtualization extensions. In Proceedings of the 15th ACM Conference on Computer and Communications Security. ACM, 51--62.Google ScholarDigital Library
M. Engleberth, Carsten Willems, and Thorsten Holz. 2009. Detecting malicious documents with combined static and dynamic analysis (PowerPoint presentation). Virus Bull. (2009). https://www.virusbulletin.com/uploads/pdf/conference_slides/2009/Willems-VB2009.pdfGoogle Scholar
Tal Garfinkel, Mendel Rosenblum et al. 2003. A virtual machine introspection based architecture for intrusion detection. In Proceedings of the Network and Distributed System Security Symposium (NDSS’03), Vol. 3. 191--206.Google Scholar
Kent Griffin, Scott Schneider, Xin Hu, and Tzi-Cker Chiueh. 2009. Automatic generation of string signatures for malware detection. In Proceedings of the International Workshop on Recent Advances in Intrusion Detection. Springer, 101--120.Google Scholar
Galen Hunt and Doug Brubacher. 1999. Detours: Binary interception of Win 32 functions. In Proceedings of the 3rd USENIX Windows NT Symposium.Google Scholar
Rafiqul Islam, Ronghua Tian, Lynn Batten, and Steve Versteeg. 2010. Classification of malware based on string and function feature selection. In Proceedings of the 2nd Cybercrime and Trustworthy Computing Workshop (CTC’10). IEEE, 9--17.Google Scholar
Jiyong Jang, David Brumley, and Shobha Venkataraman. 2011. Bitshred: Feature hashing malware for scalable triage and semantic analysis. In Proceedings of the 18th ACM Conference on Computer and Communications Security. ACM, 309--320.Google ScholarDigital Library
Joint Task Force Transformation Initiative Interagency Working Group. 2013. NIST Special Publication 800-53 Revision 4 - Security and Privacy Controls for Federal Information Systems and Organizations. Technical Report. National Institute of Science and Technology (NIST).Google Scholar
Suleyman Kondakci. 2009. A concise cost analysis of Internet malware. Comput. Sec. 28, 7 (2009), 648--659.Google ScholarDigital Library
Pavel Laskov and Nedim Šrndić. 2011. Static detection of malicious JavaScript-bearing PDF documents. In Proceedings of the 27th Annual Computer Security Applications Conference. ACM, 373--382.Google ScholarDigital Library
Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions and reversals. In Soviet Phys. Dokl., Vol. 10. 707.Google Scholar
Yun Li and Bao-Liang Lu. 2009. Feature selection based on loss-margin of nearest neighbor classification. Pattern Recog. 42, 9 (2009), 1914--1921.Google ScholarDigital Library
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM Sigplan Notices, Vol. 40. ACM, 190--200.Google Scholar
Davide Maiorca, Battista Biggio, and Giorgio Giacinto. 2019. Towards adversarial malware detection: Lessons learned from PDF-based attacks. ACM Comput. Surv. 52, 4 (2019), 78.Google Scholar
Davide Maiorca, Giorgio Giacinto, and Igino Corona. 2012. A pattern recognition system for malicious PDF files detection. In Proceedings of the International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer, 510--524.Google Scholar
Nir Nissim, Aviad Cohen, Chanan Glezer, and Yuval Elovici. 2015. Detection of malicious PDF files and directions for enhancements: A state-of-the art survey. Comput. Sec. 48 (2015), 246--266.Google ScholarDigital Library
Himanshu Pareek, P. Eswari, N. Sarat Chandra Babu, and C. Bangalore. 2013. Entropy and n-gram analysis of malicious PDF documents. Int. J. Eng. Res. Tech. 2, 2 (2013).Google Scholar
Daniel Scofield, Craig Miles, and Stephen Kuhn. 2017. Fast model learning for the detection of malicious digital documents. In Proceedings of the 7th Software Security, Protection, and Reverse Engineering/Software Security and Protection Workshop. ACM, 3.Google ScholarDigital Library
Charles Smutz and Angelos Stavrou. 2012. Malicious PDF detection using metadata and structural features. In Proceedings of the 28th Annual Computer Security Applications Conference. ACM, 239--248.Google ScholarDigital Library
Nedim Šrndic and Pavel Laskov. 2013. Detection of malicious PDF files based on hierarchical document structure. In Proceedings of the 20th Annual Network 8 Distributed System Security Symposium.Google Scholar
Cristina Vatamanu, Dragoş Gavriluţ, and Răzvan Benchea. 2012. A practical approach on clustering malicious PDF documents. J. Comput. Virol. 8, 4 (2012), 151--163.Google ScholarDigital Library

Index Terms

Automated Model Learning for Accurate Detection of Malicious Digital Documents
1. Security and privacy
  1. Intrusion/anomaly detection and malware mitigation
    1. Malware and its mitigation

Recommendations

Fast Model Learning for the Detection of Malicious Digital Documents
SSPREW-7: Proceedings of the 7th Software Security, Protection, and Reverse Engineering / Software Security and Protection Workshop

Modern cyber attacks are often conducted by distributing digital documents that contain malware. The approach detailed herein, which consists of a classifier that uses features derived from dynamic analysis of a document viewer as it renders the ...
Read More
Static detection of malicious JavaScript-bearing PDF documents
ACSAC '11: Proceedings of the 27th Annual Computer Security Applications Conference

Despite the recent security improvements in Adobe's PDF viewer, its underlying code base remains vulnerable to novel exploits. A steady flow of rapidly evolving PDF malware observed in the wild substantiates the need for novel protection instruments ...
Read More
Malicious sequential pattern mining for automatic malware detection

An effective framework using sequence mining technique is proposed for automatic malware detection.An efficient sequential pattern mining algorithm for discovering discriminative patterns between malware and benign samples.A new nearest neighbor ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Digital Threats: Research and Practice Volume 1, Issue 3
Field Notes
September 2020
93 pages
EISSN:2576-5337
DOI:10.1145/3415596
Editors:
Arun Lakhotia
University of Louisiana at Lafayette and Cythereal, USA
,
Leigh Metcalf
CERT, USA
Issue’s Table of Contents
Copyright © 2020 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 August 2020
- Online AM: 7 May 2020
- Revised: 1 January 2020
- Accepted: 1 January 2020
- Received: 1 September 2019
Published in dtrap Volume 1, Issue 3

Check for updates
Author Tags
Malware detection
anomaly detection
dynamic analysis
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 637
  Total Downloads
- Downloads (Last 12 months)110
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Automated Model Learning for Accurate Detection of Malicious Digital Documents

Digital Threats: Research and Practice

Abstract

References

Cited By

Index Terms

Recommendations

Fast Model Learning for the Detection of Malicious Digital Documents

Static detection of malicious JavaScript-bearing PDF documents

Malicious sequential pattern mining for automatic malware detection