skip to main content
10.1145/2835776.2835834acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article
Public Access

Ensemble Models for Data-driven Prediction of Malware Infections

Published:08 February 2016Publication History

ABSTRACT

Given a history of detected malware attacks, can we predict the number of malware infections in a country? Can we do this for different malware and countries? This is an important question which has numerous implications for cyber security, right from designing better anti-virus software, to designing and implementing targeted patches to more accurately measuring the economic impact of breaches. This problem is compounded by the fact that, as externals, we can only detect a fraction of actual malware infections. In this paper we address this problem using data from Symantec covering more than 1.4 million hosts and 50 malware spread across 2 years and multiple countries. We first carefully design domain-based features from both malware and machine-hosts perspectives. Secondly, inspired by epidemiological and information diffusion models, we design a novel temporal non-linear model for malware spread and detection. Finally we present ESM, an ensemble-based approach which combines both these methods to construct a more accurate algorithm. Using extensive experiments spanning multiple malware and countries, we show that ESM can effectively predict malware infection ratios over time (both the actual number and trend) upto 4 times better compared to several baselines on various metrics. Furthermore, ESM's performance is stable and robust even when the number of detected infections is low.

References

  1. E. Adar and L. A. Adamic. Tracking information epidemics in blogspace. Web Intelligence, pages 207--214, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. M. Anderson and R. M. May. Infectious Diseases of Humans. Oxford University Press, 1991.Google ScholarGoogle Scholar
  3. N. Bailey. The Mathematical Theory of Infectious Diseases and its Applications. Griffin, London, 1975.Google ScholarGoogle Scholar
  4. S. Bikhchandani, D. Hirshleifer, and I. Welch. A theory of fads, fashion, custom, and cultural change in informational cascades. Journal of Political Economy, 100(5):992--1026, October 1992.Google ScholarGoogle ScholarCross RefCross Ref
  5. L. Bilge and T. Dumitras. Before we knew it: an empirical study of zero-day attacks in the real world. In ACM Conference on Computer and Communications Security, pages 833--844, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. H. P. Chau, C. Nachenberg, J. Wilhelm, A. Wright, and C. Faloutsos. Polonium : Tera-scale graph mining for malware detection. In SDM, Mesa, AZ, April 2011.Google ScholarGoogle ScholarCross RefCross Ref
  7. A. Ganesh, L. Massoulie, and D. Towsley. The effect of network topology in spread of epidemics. IEEE INFOCOM, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  8. C. Gkantsidis, T. Karagiannis, and M. Vojnovic. Planet scale software updates. In SIGCOMM, pages 423--434, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Goldenberg, B. Libai, and E. Muller. Talk of the network: A complex systems look at the underlying process of word-of-mouth. Marketing Letters, 2001.Google ScholarGoogle Scholar
  10. M. Granovetter. Threshold models of collective behavior. Am. Journal of Sociology, 83(6):1420--1443, 1978.Google ScholarGoogle Scholar
  11. D. Gruhl, R. Guha, D. Liben-Nowell, and A. Tomkins. Information diffusion through blogspace. In WWW '04, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. W. Hethcote. The mathematics of infectious diseases. SIAM Review, 42, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. O. Kephart and S. R. White. Measuring and modeling computer virus prevalence. IEEE Computer Society Symposium on Research in Security and Privacy, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Kumar, J. Novak, P. Raghavan, and A. Tomkins. On the bursty evolution of blogspace. In WWW '03: Proceedings of the 12th international conference on World Wide Web, pages 568--576, New York, NY, USA, 2003. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Lad, X. Zhao, B. Zhang, D. Massey, and L. Zhang. Analysis of BGP Update Burst During Slammer Attack. In The 5th International Workshop on Distributed Computing, Dec 2005.Google ScholarGoogle Scholar
  16. K. Levenberg. A method for the solution of certain non-linear problems in least squares. Quarterly Journal of Applied Mathmatics, II(2):164--168, 1944.Google ScholarGoogle ScholarCross RefCross Ref
  17. J. Li, Z. Wu, and E. Purpus. CAM04--5: Toward Understanding the Behavior of BGP During Large-Scale Power Outages. GLOBECOM '06. IEEE, pages 1--5, Nov. 2006.Google ScholarGoogle ScholarCross RefCross Ref
  18. Y. Matsubara, Y. Sakurai, B. A. Prakash, L. Li, and C. Faloutsos. Rise and fall patterns of information diffusion: model and implications. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '12, pages 6--14, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Matsubara, Y. Sakurai, W. G. Van-Panhuis, and C. Faloutsos. Funnel: automatic mining of spatially coevolving epidemics. In KDD, pages 105--114, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. G. McKendrick. Applications of mathematics to medical problems. In Proceedings of Edin. Math. Society, volume 44, pages 98--130, 1925.Google ScholarGoogle ScholarCross RefCross Ref
  21. D. Moore, V. Paxson, S. Savage, C. Shannon, S. Staniford, and N. Weaver. Inside the Slammer worm. Security & Privacy, IEEE, 1(4):33--39, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Moore, C. Shannon, and K. C. Claffy. Code-red: a case study on the spread and victims of an internet worm. In Internet Measurement Workshop, pages 273--284, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. E. E. Papalexakis, T. Dumitras, D. H. Chau, B. A. Prakash, and C. Faloutsos. Spatio-temporal mining of software adoption & penetration. In 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. B. A. Prakash, D. Chakrabarti, M. Faloutsos, N. Valler, and C. Faloutsos. Threshold conditions for arbitrary cascade models on arbitrary networks. In ICDM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Staniford, D. Moore, V. Paxson, and N. Weaver. The top speed of flash worms. In WORM, pages 33--42, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Staniford, V. Paxson, and N. Weaver. How to 0wn the internet in your spare time. In Proceedings of the 11th USENIX Security Symposium, pages 149--167, Berkeley, CA, USA, 2002. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. L. Wang, X. Zhao, D. Pei, R. Bush, D. Massey, A. Mankin, S. Wu, and L. Zhang. Observation and Analysis of BGP Behavior under Stress. In IMW, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. N. Weaver and D. Ellis. Reflections on Witty: Analyzing the attacker. ;login: The USENIX Magazine, 29(3):34--37, June 2004.Google ScholarGoogle Scholar

Index Terms

  1. Ensemble Models for Data-driven Prediction of Malware Infections

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WSDM '16: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining
      February 2016
      746 pages
      ISBN:9781450337168
      DOI:10.1145/2835776

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 February 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      WSDM '16 Paper Acceptance Rate67of368submissions,18%Overall Acceptance Rate498of2,863submissions,17%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader