research-article

Machine Learning for Encrypted Malware Traffic Classification: Accounting for Noisy Labels and Non-Stationarity

Authors:
Blake Anderson

Cisco Systems, Inc., San Jose, CA, USA

Cisco Systems, Inc., San Jose, CA, USA
View Profile

,
David McGrew

Cisco Systems, Inc., San Jose, CA, USA

Cisco Systems, Inc., San Jose, CA, USA
View Profile

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningAugust 2017Pages 1723–1732https://doi.org/10.1145/3097983.3098163

Published:13 August 2017Publication History

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 1723–1732

ABSTRACT

The application of machine learning for the detection of malicious network traffic has been well researched over the past several decades; it is particularly appealing when the traffic is encrypted because traditional pattern-matching approaches cannot be used. Unfortunately, the promise of machine learning has been slow to materialize in the network security domain. In this paper, we highlight two primary reasons why this is the case: inaccurate ground truth and a highly non-stationary data distribution. To demonstrate and understand the effect that these pitfalls have on popular machine learning algorithms, we design and carry out experiments that show how six common algorithms perform when confronted with real network data. With our experimental results, we identify the situations in which certain classes of algorithms underperform on the task of encrypted malware traffic classification. We offer concrete recommendations for practitioners given the real-world constraints outlined. From an algorithmic perspective, we find that the random forest ensemble method outperformed competing methods. More importantly, feature engineering was decisive; we found that iterating on the initial feature set, and including features suggested by domain experts, had a much greater impact on the performance of the classification system. For example, linear regression using the more expressive feature set easily outperformed the random forest method using a standard network traffic representation on all criteria considered. Our analysis is based on millions of TLS encrypted sessions collected over 12 months from a commercial malware sandbox and two geographically distinct, large enterprise networks.

References

Blake Anderson and David McGrew 2016. Identifying Encrypted Malware Traffic with Contextual Flow Data ACM Workshop on Artificial Intelligence and Security (AISec). 35--46.Google Scholar
Blake Anderson, Subharthi Paul, and David McGrew. 2016. Deciphering Malware's Use of TLS (without Decryption) ArXiv e-prints.Google Scholar
Mike Belshe, Roberto Peon, and Martin Thomson. 2015. Hypertext Transfer Protocol Version 2 (HTTP/2). RFC 7540 (Proposed Standard). (2015). showURL%http://www.ietf.org/rfc/rfc7540.txtGoogle Scholar
Battista Biggio, Blaine Nelson, and Pavel Laskov. 2011. Support Vector Machines Under Adversarial Label Noise Asian Conference on Machine Learning. 97--112.Google Scholar
Battista Biggio, Blaine Nelson, and Pavel Laskov. 2012. Poisoning Attacks against Support Vector Machines International Conference on Machine Learning (ICML). 1807--1814.Google Scholar
Leyla Bilge, Davide Balzarotti, William Robertson, Engin Kirda, and Christopher Kruegel. 2012. Disclosure: Detecting Botnet Command and Control Servers through Large-Scale NetFlow Analysis. In ACM Annual Computer Security Applications Conference (ACSAC). 129--138. Google ScholarDigital Library
Christopher Bishop. 2006. Pattern Recognition. Machine Learning Vol. 128 (2006), 1--58.Google Scholar
Simon Blake-Wilson, Nelson Bolyard, Vipul Gupta, Chris Hawk, and Bodo Moeller 2006. Elliptic Curve Cryptography (ECC) Cipher Suites for Transport Layer Security (TLS). RFC 4492 (Informational). (2006). showURL%http://www.ietf.org/rfc/rfc4492.txtGoogle Scholar
Leo Breiman. 2001. Random Forests. Machine Learning, Vol. 45, 1 (2001), 5--32. Google ScholarDigital Library
Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen 1984. Classification and Regression Trees. CRC press.Google Scholar
J Michael Butler. 2013. Finding Hidden Threats by Decrypting SSL. SANS Institute (2013).Google Scholar
Franccois Chollet. 2017. Keras. (2017). showURL%https://github.com/fchollet/kerasshownoteAccessed: 2017-04--19.Google Scholar
Cisco Talos. 2017. IP Blacklist Feed. (2017). showURL%http://www.talosintel.com/feeds/ip-filter.blfshownoteAccessed: 2017-04--19.Google Scholar
Benoit Claise. 2004. Cisco Systems NetFlow Services Export Version 9. RFC 3954 (Informational). (2004). showURL%http://www.ietf.org/rfc/rfc3954.txtGoogle Scholar
Benoit Claise, Brian Trammell, and Paul Aitken. 2013. Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of Flow Information. RFC 7011 (Proposed Standard). (2013). showURL%http://www.ietf.org/rfc/rfc7011.txtGoogle Scholar
Corinna Cortes and Vladimir Vapnik 1995. Support-Vector Networks. Machine Learning, Vol. 20, 3 (1995), 273--297. Google ScholarDigital Library
Nilesh Dalvi, Pedro Domingos, Sumit Sanghai, Deepak Verma, and others 2004. Adversarial Classification. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 99--108. Google ScholarDigital Library
Tim Dierks and Eric Rescorla 2008. The Transport Layer Security (TLS) Protocol Version 1.2. RFC 5246 (Proposed Standard). (2008). showURL%http://www.ietf.org/rfc/rfc5246.txtGoogle Scholar
Pedro Domingos. 2012. A Few Useful Things to Know about Machine Learning. Communications of the ACM Vol. 55, 10 (2012), 78--87. Google ScholarDigital Library
Floriana Esposito, Donato Malerba, Giovanni Semeraro, and J Kay 1997. A Comparative Analysis of Methods for Pruning Decision Trees. Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, 5 (1997), 476--491. Google ScholarDigital Library
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin 2008. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research Vol. 9, Aug (2008), 1871--1874.Google ScholarDigital Library
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Springer.Google Scholar
Jerome Friedman, Trevor Hastie, and Rob Tibshirani. 2010. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software Vol. 33, 1 (2010). Google ScholarCross Ref
Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee 2008. BotMiner: Clustering Analysis of Network Traffic for Protocol-and Structure-Independent Botnet Detection. USENIX Security Symposium. 139--154.Google Scholar
Matt Harrigan. 2016. Machine Learning is not the Answer to Better Network Security. (2016). showURL%https://techcrunch.com/2016/02/29/machine-learning-is-not-the-answer-to-better-network-security/shownoteAccessed: 2017-04--19.Google Scholar
Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors ArXiv e-prints.Google Scholar
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep Learning. Nature, Vol. 521, 7553 (2015), 436--444. Google Scholar
Daniel Lowd and Christopher Meek 2005. Adversarial Learning. In ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD). 641--647. Google ScholarDigital Library
David McGrew and Blake Anderson 2016. Enhanced Telemetry for Encrypted Threat Analytics IEEE ICNP Workshop on Machine Learning in Computer Networks (NetworkML). 1--6.Google Scholar
David McGrew, Blake Anderson, Bill Hudson, and Philip Perricone 2017. Joy. https://github.com/cisco/joy. (2017).Google Scholar
Andrew W Moore and Denis Zuev 2005. Internet Traffic Classification Using Bayesian Analysis Techniques. SIGMETRICS Performance Evaluation Review Vol. 33 (2005), 50--60. Google ScholarDigital Library
Vern Paxson. 1999. Bro: a System for Detecting Network Intruders in Real-Time. Computer Networks, Vol. 31, 23--24 (1999), 2435--2463.Google ScholarDigital Library
Fabian Pedregosa, Ga:el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Edouard Duchesnay 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research Vol. 12 (2011), 2825--2830.Google ScholarDigital Library
Eric Rescorla. 2017. The Transport Layer Security (TLS) Protocol Version 1.3 (draft 20). Intended Status: Standards Track. (2017). showURL%https://tools.ietf.org/html/draft-ietf-tls-tls13--20Google Scholar
Martin Roesch. 1999. Snort - Lightweight Intrusion Detection for Networks USENIX Large Installation System Administration Conference (LISA). 229--238.Google Scholar
Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K Reiter 2016. Accessorize to a Crime: Real and Stealthy Attacks on State-of-the-Art Face Recognition. In ACM SIGSAC Conference on Computer and Communications Security (CCS). 1528--1540. Google ScholarDigital Library
Robin Sommer and Vern Paxson 2010. Outside the Closed World: On using Machine Learning for Network Intrusion Detection. In IEEE Symposium on Security and Privacy (S&P). 305--316.Google ScholarDigital Library
Florian Tegeler, Xiaoming Fu, Giovanni Vigna, and Christopher Kruegel 2012. Botfinder: Finding Bots in Network Traffic without Deep Packet Inspection ACM International Conference on Emerging Networking Experiments and Technologies (Co-NEXT). 349--360.Google Scholar
Nigel Williams, Sebastian Zander, and Grenville Armitage. 2006. A Preliminary Performance Comparison of Five Machine Learning Algorithms for Practical IP Traffic Flow Classification. SIGCOMM Computer Communication Review Vol. 36, 5 (2006), 5--16. Google ScholarDigital Library
David H Wolpert and William G Macready 1997. No Free Lunch Theorems for Optimization. Transactions on Evolutionary Computation Vol. 1, 1 (1997), 67--82. Google ScholarDigital Library

Index Terms

Machine Learning for Encrypted Malware Traffic Classification: Accounting for Noisy Labels and Non-Stationarity
1. Computing methodologies
  1. Machine learning
2. Security and privacy
  1. Intrusion/anomaly detection and malware mitigation
    1. Malware and its mitigation
  2. Network security
    1. Web protocol security

Recommendations

Machine Learning and Images for Malware Detection and Classification
PCI '17: Proceedings of the 21st Pan-Hellenic Conference on Informatics

Detecting malicious code with exact match on collected datasets is becoming a large-scale identification problem due to the existence of new malware variants. Being able to promptly and accurately identify new attacks enables security experts to respond ...
Read More
Malware classification using deep learning methods
ACMSE '18: Proceedings of the ACMSE 2018 Conference

Malware, short for Malicious Software, is growing continuously in numbers and sophistication as our digital world continuous to grow. It is a very serious problem and many efforts are devoted to malware detection in today's cybersecurity world. Many ...
Read More
Zero-Day Malware Classification and Detection Using Machine Learning
Abstract
A zero-day vulnerability is a weakness of the computer software and hardware that has yet to be discovered by people who might be interested in fixing it. Hackers may use these vulnerabilities to harm computer programs, data, other systems, or a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2017
2240 pages
ISBN:9781450348874
DOI:10.1145/3097983
General Chairs:
Stan Matwin
Dalhousie University
,
Shipeng Yu
LinkedIn
,
Faisal Farooq
IBM
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 August 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
machine learning
malware detection
network security
tls
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '17 Paper Acceptance Rate64of748submissions,9%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 156
  Total Citations
  View Citations
- 3,866
  Total Downloads
- Downloads (Last 12 months)271
- Downloads (Last 6 weeks)29
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Machine Learning for Encrypted Malware Traffic Classification: Accounting for Noisy Labels and Non-Stationarity

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Machine Learning and Images for Malware Detection and Classification

Malware classification using deep learning methods

Zero-Day Malware Classification and Detection Using Machine Learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Machine Learning for Encrypted Malware Traffic Classification: Accounting for Noisy Labels and Non-Stationarity

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Machine Learning and Images for Malware Detection and Classification

Malware classification using deep learning methods

Zero-Day Malware Classification and Detection Using Machine Learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media