skip to main content
10.1145/3292006.3300032acmconferencesArticle/Chapter ViewAbstractPublication PagescodaspyConference Proceedingsconference-collections
research-article

Adversarial Authorship Attribution in Open-Source Projects

Published:13 March 2019Publication History

ABSTRACT

Open-source software is open to anyone by design, whether it is a community of developers, hackers or malicious users. Authors of open-source software typically hide their identity through nicknames and avatars. However, they have no protection against authorship attribution techniques that are able to create software author profiles just by analyzing software characteristics. In this paper we present an author imitation attack that allows to deceive current authorship attribution systems and mimic a coding style of a target developer. Withing this context we explore the potential of the existing attribution techniques to be deceived. Our results show that we are able to imitate the coding style of the developers based on the data collected from the popular source code repository, GitHub. To subvert author imitation attack, we propose a novel author obfuscation approach that allows us to hide the coding style of the author. Unlike existing obfuscation tools, this new obfuscation technique uses transformations that preserve code readability. We assess the effectiveness of our attacks on several datasets produced by actual developers from GitHub, and participants of the GoogleCodeJam competition. Throughout our experiments we show that the author hiding can be achieved by making sensible transformations which significantly reduce the likelihood of identifying the author's style to 0% by current authorship attribution systems.

References

  1. 2014. Stunnix. Retrieved November 2014 from http://www.stunnix.com/prod/cxxo/Google ScholarGoogle Scholar
  2. 2014. Tigress. http://tigress.cs.arizona.eduGoogle ScholarGoogle Scholar
  3. Saed Alrabaee, Noman Saleem, Stere Preda, Lingyu Wang, and Mourad Debbabi. 2014. Oba2: An onion approach to binary code authorship attribution. Digital Investigation 11 (2014), S94--S103.Google ScholarGoogle ScholarCross RefCross Ref
  4. Bander Alsulami, Edwin Dauber, Richard Harang, Spiros Mancoridis, and Rachel Greenstadt. 2017. Source Code Authorship Attribution Using Long Short-Term Memory Based Networks. In European Symposium on Research in Computer Security. Springer, 65--82.Google ScholarGoogle Scholar
  5. Harald Baayen, Hans Van Halteren, and Fiona Tweedie. 1996. Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing 11, 3 (1996), 121--132.Google ScholarGoogle ScholarCross RefCross Ref
  6. Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndic, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. 2013. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases. Springer, 387--402. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3,Jan (2003), 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Steven Burrows and Seyed MM Tahaghoghi. 2007. Source code authorship attribution using n-grams. In Proceedings of the Twelth Australasian Document Computing Symposium, Melbourne, Australia, RMIT University. Citeseer, 32--39.Google ScholarGoogle Scholar
  9. Steven Burrows, Alexandra L Uitdenbogerd, and Andrew Turpin. 2009. Application of information retrieval techniques for source code authorship attribution. In International Conference on Database Systems for Advanced Applications. Springer, 699--713. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Steven Burrows, Alexandra L Uitdenbogerd, and Andrew Turpin. 2014. Comparing techniques for authorship attribution of source code. Software: Practice and Experience 44, 1 (2014), 1--32.Google ScholarGoogle ScholarCross RefCross Ref
  11. Aylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss, Fabian Yamaguchi, and Rachel Greenstadt. 2015. De-anonymizing programmers via code stylometry. In 24th USENIX Security Symposium (USENIX Security), Washington, DC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Aylin Caliskan-Islam, Fabian Yamaguchi, Edwin Dauber, Richard Harang, Konrad Rieck, Rachel Greenstadt, and Arvind Narayanan. 2015. When coding style survives compilation: De-anonymizing programmers from executable binaries. arXiv preprint arXiv:1512.08546 (2015).Google ScholarGoogle Scholar
  13. Edwin Dauber, Aylin Caliskan-Islam, Richard Harang, and Rachel Greenstadt. 2017. Git Blame Who?: Stylistic Authorship Attribution of Small, Incomplete Source Code Fragments. arXiv preprint arXiv:1701.05681 (2017).Google ScholarGoogle Scholar
  14. Haibiao Ding and Mansur H Samadzadeh. 2004. Extraction of Java program fingerprints for software authorship identification. Journal of Systems and Software 72, 1 (2004), 49--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jeanne Ferrante, Karl J Ottenstein, and Joe D Warren. 1987. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems (TOPLAS) 9, 3 (1987), 319--349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. 2009. The WEKA data mining software: an update. ACM SIGKDD explorations newsletter 11, 1 (2009), 10--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. RI Kilgour, AR Gray, PJ Sallis, and SG MacDonell. 1998. A fuzzy logic approach to computer software source code authorship analysis. (1998).Google ScholarGoogle Scholar
  18. Sangkyum Kim, Hyungsul Kim, Tim Weninger, Jiawei Han, and Hyun Duk Kim. 2011. Authorship classification: a discriminative syntactic tree mining approach. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 455--464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jay Kothari, Maxim Shevertalov, Edward Stehle, and Spiros Mancoridis. 2007. A probabilistic approach to source code authorship identification. In Information Technology, 2007. ITNG'07. Fourth International Conference on. IEEE, 243--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ivan Krsul and Eugene H Spafford. 1997. Authorship analysis: Identifying the author of a program. Computers & Security 16, 3 (1997), 233--257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Thomas J McCabe. 1976. A complexity measure. IEEE Transactions on software Engineering 4 (1976), 308--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Xiaozhu Meng and Barton P Miller. {n. d.}. Binary Code Multi- Author Identification in Multi-Toolchain Scenarios. ({n. d.}).Google ScholarGoogle Scholar
  23. Arvind Narayanan, Hristo Paskov, Neil Zhenqiang Gong, John Bethencourt, Emil Stefanov, Eui Chul Richard Shin, and Dawn Song. 2012. On the feasibility of internet-scale author identification. In Security and Privacy (SP), 2012 IEEE Symposium on. IEEE, 300--314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Baishakhi Ray, Daryl Posnett, Vladimir Filkov, and Premkumar Devanbu. 2014. A large scale study of programming languages and code quality in github. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 155--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Nathan Rosenblum, Xiaojin Zhu, and Barton Miller. 2011. Who wrote this code? identifying the authors of program binaries. Computer Security--ESORICS 2011 (2011), 172--189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Philip Sallis, Asbjorn Aakjaer, and Stephen MacDonell. 1996. Software forensics: old methods for a new science. In Software Engineering: Education and Practice, 1996. Proceedings. International Conference. IEEE, 481--485. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Eugene H Spafford and Stephen A Weeber. 1993. Software forensics: Can we track code to its authors? Computers & Security 12, 6 (1993), 585--595. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Danny van Bruggen. 2017. JavaParser. Retrieved November 15, 2017 from https://javaparser.org/index.htmlGoogle ScholarGoogle Scholar
  29. Chenxi Wang and John Knight. 2001. A security architecture for survivability mechanisms. University of Virginia.Google ScholarGoogle ScholarCross RefCross Ref
  30. Mark Weiser. 1981. Program slicing. In Proceedings of the 5th international conference on Software engineering. IEEE Press, 439--449. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2016. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Adversarial Authorship Attribution in Open-Source Projects

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CODASPY '19: Proceedings of the Ninth ACM Conference on Data and Application Security and Privacy
      March 2019
      373 pages
      ISBN:9781450360999
      DOI:10.1145/3292006

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 March 2019

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate149of789submissions,19%

      Upcoming Conference

      CODASPY '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader