ABSTRACT
Open-source software is open to anyone by design, whether it is a community of developers, hackers or malicious users. Authors of open-source software typically hide their identity through nicknames and avatars. However, they have no protection against authorship attribution techniques that are able to create software author profiles just by analyzing software characteristics. In this paper we present an author imitation attack that allows to deceive current authorship attribution systems and mimic a coding style of a target developer. Withing this context we explore the potential of the existing attribution techniques to be deceived. Our results show that we are able to imitate the coding style of the developers based on the data collected from the popular source code repository, GitHub. To subvert author imitation attack, we propose a novel author obfuscation approach that allows us to hide the coding style of the author. Unlike existing obfuscation tools, this new obfuscation technique uses transformations that preserve code readability. We assess the effectiveness of our attacks on several datasets produced by actual developers from GitHub, and participants of the GoogleCodeJam competition. Throughout our experiments we show that the author hiding can be achieved by making sensible transformations which significantly reduce the likelihood of identifying the author's style to 0% by current authorship attribution systems.
- 2014. Stunnix. Retrieved November 2014 from http://www.stunnix.com/prod/cxxo/Google Scholar
- 2014. Tigress. http://tigress.cs.arizona.eduGoogle Scholar
- Saed Alrabaee, Noman Saleem, Stere Preda, Lingyu Wang, and Mourad Debbabi. 2014. Oba2: An onion approach to binary code authorship attribution. Digital Investigation 11 (2014), S94--S103.Google ScholarCross Ref
- Bander Alsulami, Edwin Dauber, Richard Harang, Spiros Mancoridis, and Rachel Greenstadt. 2017. Source Code Authorship Attribution Using Long Short-Term Memory Based Networks. In European Symposium on Research in Computer Security. Springer, 65--82.Google Scholar
- Harald Baayen, Hans Van Halteren, and Fiona Tweedie. 1996. Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing 11, 3 (1996), 121--132.Google ScholarCross Ref
- Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndic, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. 2013. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases. Springer, 387--402. Google ScholarDigital Library
- David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3,Jan (2003), 993--1022. Google ScholarDigital Library
- Steven Burrows and Seyed MM Tahaghoghi. 2007. Source code authorship attribution using n-grams. In Proceedings of the Twelth Australasian Document Computing Symposium, Melbourne, Australia, RMIT University. Citeseer, 32--39.Google Scholar
- Steven Burrows, Alexandra L Uitdenbogerd, and Andrew Turpin. 2009. Application of information retrieval techniques for source code authorship attribution. In International Conference on Database Systems for Advanced Applications. Springer, 699--713. Google ScholarDigital Library
- Steven Burrows, Alexandra L Uitdenbogerd, and Andrew Turpin. 2014. Comparing techniques for authorship attribution of source code. Software: Practice and Experience 44, 1 (2014), 1--32.Google ScholarCross Ref
- Aylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss, Fabian Yamaguchi, and Rachel Greenstadt. 2015. De-anonymizing programmers via code stylometry. In 24th USENIX Security Symposium (USENIX Security), Washington, DC. Google ScholarDigital Library
- Aylin Caliskan-Islam, Fabian Yamaguchi, Edwin Dauber, Richard Harang, Konrad Rieck, Rachel Greenstadt, and Arvind Narayanan. 2015. When coding style survives compilation: De-anonymizing programmers from executable binaries. arXiv preprint arXiv:1512.08546 (2015).Google Scholar
- Edwin Dauber, Aylin Caliskan-Islam, Richard Harang, and Rachel Greenstadt. 2017. Git Blame Who?: Stylistic Authorship Attribution of Small, Incomplete Source Code Fragments. arXiv preprint arXiv:1701.05681 (2017).Google Scholar
- Haibiao Ding and Mansur H Samadzadeh. 2004. Extraction of Java program fingerprints for software authorship identification. Journal of Systems and Software 72, 1 (2004), 49--57. Google ScholarDigital Library
- Jeanne Ferrante, Karl J Ottenstein, and Joe D Warren. 1987. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems (TOPLAS) 9, 3 (1987), 319--349. Google ScholarDigital Library
- Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. 2009. The WEKA data mining software: an update. ACM SIGKDD explorations newsletter 11, 1 (2009), 10--18. Google ScholarDigital Library
- RI Kilgour, AR Gray, PJ Sallis, and SG MacDonell. 1998. A fuzzy logic approach to computer software source code authorship analysis. (1998).Google Scholar
- Sangkyum Kim, Hyungsul Kim, Tim Weninger, Jiawei Han, and Hyun Duk Kim. 2011. Authorship classification: a discriminative syntactic tree mining approach. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 455--464. Google ScholarDigital Library
- Jay Kothari, Maxim Shevertalov, Edward Stehle, and Spiros Mancoridis. 2007. A probabilistic approach to source code authorship identification. In Information Technology, 2007. ITNG'07. Fourth International Conference on. IEEE, 243--248. Google ScholarDigital Library
- Ivan Krsul and Eugene H Spafford. 1997. Authorship analysis: Identifying the author of a program. Computers & Security 16, 3 (1997), 233--257. Google ScholarDigital Library
- Thomas J McCabe. 1976. A complexity measure. IEEE Transactions on software Engineering 4 (1976), 308--320. Google ScholarDigital Library
- Xiaozhu Meng and Barton P Miller. {n. d.}. Binary Code Multi- Author Identification in Multi-Toolchain Scenarios. ({n. d.}).Google Scholar
- Arvind Narayanan, Hristo Paskov, Neil Zhenqiang Gong, John Bethencourt, Emil Stefanov, Eui Chul Richard Shin, and Dawn Song. 2012. On the feasibility of internet-scale author identification. In Security and Privacy (SP), 2012 IEEE Symposium on. IEEE, 300--314. Google ScholarDigital Library
- Baishakhi Ray, Daryl Posnett, Vladimir Filkov, and Premkumar Devanbu. 2014. A large scale study of programming languages and code quality in github. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 155--165. Google ScholarDigital Library
- Nathan Rosenblum, Xiaojin Zhu, and Barton Miller. 2011. Who wrote this code? identifying the authors of program binaries. Computer Security--ESORICS 2011 (2011), 172--189. Google ScholarDigital Library
- Philip Sallis, Asbjorn Aakjaer, and Stephen MacDonell. 1996. Software forensics: old methods for a new science. In Software Engineering: Education and Practice, 1996. Proceedings. International Conference. IEEE, 481--485. Google ScholarDigital Library
- Eugene H Spafford and Stephen A Weeber. 1993. Software forensics: Can we track code to its authors? Computers & Security 12, 6 (1993), 585--595. Google ScholarDigital Library
- Danny van Bruggen. 2017. JavaParser. Retrieved November 15, 2017 from https://javaparser.org/index.htmlGoogle Scholar
- Chenxi Wang and John Knight. 2001. A security architecture for survivability mechanisms. University of Virginia.Google ScholarCross Ref
- Mark Weiser. 1981. Program slicing. In Proceedings of the 5th international conference on Software engineering. IEEE Press, 439--449. Google ScholarDigital Library
- Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2016. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann. Google ScholarDigital Library
Index Terms
- Adversarial Authorship Attribution in Open-Source Projects
Recommendations
Android authorship attribution through string analysis
ARES '18: Proceedings of the 13th International Conference on Availability, Reliability and SecurityWith the rising popularity of Android mobile devices, the amount of malicious applications targeting the Android platform has been increasing tremendously. To mitigate the risk of malicious apps, there is a need for an automated system to detect these ...
Code Authorship Attribution: Methods and Challenges
Code authorship attribution is the process of identifying the author of a given code. With increasing numbers of malware and advanced mutation techniques, the authors of malware are creating a large number of malware variants. To better deal with this ...
AuthAttLyzer: A Robust defensive distillation-based Authorship Attribution framework
ICCNS '22: Proceedings of the 2022 12th International Conference on Communication and Network SecuritySource Code Authorship Attribution (SCAA) is the technique to find the real author of source code in a corpus. Though it is a privacy threat to open-source programmers, it has shown to be significantly helpful in developing forensic-based applications ...
Comments