research-article

Adversarial Authorship Attribution in Open-Source Projects

Authors:
Alina Matyukhina

University of New Brunswick, Fredericton, Canada

University of New Brunswick, Fredericton, Canada
View Profile

,
Natalia Stakhanova

University of Saskatchewan, Saskatoon, Canada

University of Saskatchewan, Saskatoon, Canada
View Profile

,
Mila Dalla Preda

University of Verona, Verona, Italy

University of Verona, Verona, Italy
View Profile

,
Celine Perley

University of New Brunswick, Fredericton, Canada

University of New Brunswick, Fredericton, Canada
View Profile

CODASPY '19: Proceedings of the Ninth ACM Conference on Data and Application Security and PrivacyMarch 2019Pages 291–302https://doi.org/10.1145/3292006.3300032

Published:13 March 2019Publication History

CODASPY '19: Proceedings of the Ninth ACM Conference on Data and Application Security and Privacy

Pages 291–302

ABSTRACT

Open-source software is open to anyone by design, whether it is a community of developers, hackers or malicious users. Authors of open-source software typically hide their identity through nicknames and avatars. However, they have no protection against authorship attribution techniques that are able to create software author profiles just by analyzing software characteristics. In this paper we present an author imitation attack that allows to deceive current authorship attribution systems and mimic a coding style of a target developer. Withing this context we explore the potential of the existing attribution techniques to be deceived. Our results show that we are able to imitate the coding style of the developers based on the data collected from the popular source code repository, GitHub. To subvert author imitation attack, we propose a novel author obfuscation approach that allows us to hide the coding style of the author. Unlike existing obfuscation tools, this new obfuscation technique uses transformations that preserve code readability. We assess the effectiveness of our attacks on several datasets produced by actual developers from GitHub, and participants of the GoogleCodeJam competition. Throughout our experiments we show that the author hiding can be achieved by making sensible transformations which significantly reduce the likelihood of identifying the author's style to 0% by current authorship attribution systems.

References

2014. Stunnix. Retrieved November 2014 from http://www.stunnix.com/prod/cxxo/Google Scholar
2014. Tigress. http://tigress.cs.arizona.eduGoogle Scholar
Saed Alrabaee, Noman Saleem, Stere Preda, Lingyu Wang, and Mourad Debbabi. 2014. Oba2: An onion approach to binary code authorship attribution. Digital Investigation 11 (2014), S94--S103.Google ScholarCross Ref
Bander Alsulami, Edwin Dauber, Richard Harang, Spiros Mancoridis, and Rachel Greenstadt. 2017. Source Code Authorship Attribution Using Long Short-Term Memory Based Networks. In European Symposium on Research in Computer Security. Springer, 65--82.Google Scholar
Harald Baayen, Hans Van Halteren, and Fiona Tweedie. 1996. Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing 11, 3 (1996), 121--132.Google ScholarCross Ref
Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndic, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. 2013. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases. Springer, 387--402. Google ScholarDigital Library
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3,Jan (2003), 993--1022. Google ScholarDigital Library
Steven Burrows and Seyed MM Tahaghoghi. 2007. Source code authorship attribution using n-grams. In Proceedings of the Twelth Australasian Document Computing Symposium, Melbourne, Australia, RMIT University. Citeseer, 32--39.Google Scholar
Steven Burrows, Alexandra L Uitdenbogerd, and Andrew Turpin. 2009. Application of information retrieval techniques for source code authorship attribution. In International Conference on Database Systems for Advanced Applications. Springer, 699--713. Google ScholarDigital Library
Steven Burrows, Alexandra L Uitdenbogerd, and Andrew Turpin. 2014. Comparing techniques for authorship attribution of source code. Software: Practice and Experience 44, 1 (2014), 1--32.Google ScholarCross Ref
Aylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss, Fabian Yamaguchi, and Rachel Greenstadt. 2015. De-anonymizing programmers via code stylometry. In 24th USENIX Security Symposium (USENIX Security), Washington, DC. Google ScholarDigital Library
Aylin Caliskan-Islam, Fabian Yamaguchi, Edwin Dauber, Richard Harang, Konrad Rieck, Rachel Greenstadt, and Arvind Narayanan. 2015. When coding style survives compilation: De-anonymizing programmers from executable binaries. arXiv preprint arXiv:1512.08546 (2015).Google Scholar
Edwin Dauber, Aylin Caliskan-Islam, Richard Harang, and Rachel Greenstadt. 2017. Git Blame Who?: Stylistic Authorship Attribution of Small, Incomplete Source Code Fragments. arXiv preprint arXiv:1701.05681 (2017).Google Scholar
Haibiao Ding and Mansur H Samadzadeh. 2004. Extraction of Java program fingerprints for software authorship identification. Journal of Systems and Software 72, 1 (2004), 49--57. Google ScholarDigital Library
Jeanne Ferrante, Karl J Ottenstein, and Joe D Warren. 1987. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems (TOPLAS) 9, 3 (1987), 319--349. Google ScholarDigital Library
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. 2009. The WEKA data mining software: an update. ACM SIGKDD explorations newsletter 11, 1 (2009), 10--18. Google ScholarDigital Library
RI Kilgour, AR Gray, PJ Sallis, and SG MacDonell. 1998. A fuzzy logic approach to computer software source code authorship analysis. (1998).Google Scholar
Sangkyum Kim, Hyungsul Kim, Tim Weninger, Jiawei Han, and Hyun Duk Kim. 2011. Authorship classification: a discriminative syntactic tree mining approach. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 455--464. Google ScholarDigital Library
Jay Kothari, Maxim Shevertalov, Edward Stehle, and Spiros Mancoridis. 2007. A probabilistic approach to source code authorship identification. In Information Technology, 2007. ITNG'07. Fourth International Conference on. IEEE, 243--248. Google ScholarDigital Library
Ivan Krsul and Eugene H Spafford. 1997. Authorship analysis: Identifying the author of a program. Computers & Security 16, 3 (1997), 233--257. Google ScholarDigital Library
Thomas J McCabe. 1976. A complexity measure. IEEE Transactions on software Engineering 4 (1976), 308--320. Google ScholarDigital Library
Xiaozhu Meng and Barton P Miller. {n. d.}. Binary Code Multi- Author Identification in Multi-Toolchain Scenarios. ({n. d.}).Google Scholar
Arvind Narayanan, Hristo Paskov, Neil Zhenqiang Gong, John Bethencourt, Emil Stefanov, Eui Chul Richard Shin, and Dawn Song. 2012. On the feasibility of internet-scale author identification. In Security and Privacy (SP), 2012 IEEE Symposium on. IEEE, 300--314. Google ScholarDigital Library
Baishakhi Ray, Daryl Posnett, Vladimir Filkov, and Premkumar Devanbu. 2014. A large scale study of programming languages and code quality in github. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 155--165. Google ScholarDigital Library
Nathan Rosenblum, Xiaojin Zhu, and Barton Miller. 2011. Who wrote this code? identifying the authors of program binaries. Computer Security--ESORICS 2011 (2011), 172--189. Google ScholarDigital Library
Philip Sallis, Asbjorn Aakjaer, and Stephen MacDonell. 1996. Software forensics: old methods for a new science. In Software Engineering: Education and Practice, 1996. Proceedings. International Conference. IEEE, 481--485. Google ScholarDigital Library
Eugene H Spafford and Stephen A Weeber. 1993. Software forensics: Can we track code to its authors? Computers & Security 12, 6 (1993), 585--595. Google ScholarDigital Library
Danny van Bruggen. 2017. JavaParser. Retrieved November 15, 2017 from https://javaparser.org/index.htmlGoogle Scholar
Chenxi Wang and John Knight. 2001. A security architecture for survivability mechanisms. University of Virginia.Google ScholarCross Ref
Mark Weiser. 1981. Program slicing. In Proceedings of the 5th international conference on Software engineering. IEEE Press, 439--449. Google ScholarDigital Library
Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2016. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann. Google ScholarDigital Library

Index Terms

Adversarial Authorship Attribution in Open-Source Projects
1. Security and privacy
  1. Software and application security
    1. Software security engineering

Recommendations

Android authorship attribution through string analysis
ARES '18: Proceedings of the 13th International Conference on Availability, Reliability and Security

With the rising popularity of Android mobile devices, the amount of malicious applications targeting the Android platform has been increasing tremendously. To mitigate the risk of malicious apps, there is a need for an automated system to detect these ...
Read More
Code Authorship Attribution: Methods and Challenges

Code authorship attribution is the process of identifying the author of a given code. With increasing numbers of malware and advanced mutation techniques, the authors of malware are creating a large number of malware variants. To better deal with this ...
Read More
AuthAttLyzer: A Robust defensive distillation-based Authorship Attribution framework
ICCNS '22: Proceedings of the 2022 12th International Conference on Communication and Network Security

Source Code Authorship Attribution (SCAA) is the technique to find the real author of source code in a corpus. Though it is a privacy threat to open-source programmers, it has shown to be significantly helpful in developing forensic-based applications ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CODASPY '19: Proceedings of the Ninth ACM Conference on Data and Application Security and Privacy
March 2019
373 pages
ISBN:9781450360999
DOI:10.1145/3292006
General Chairs:
Gail-Joon Ahn
Arizona State University, USA
,
Bhavani Thuraisingham
University of Texas at Dallas, USA
,
Program Chairs:
Murat Kantarcioglu
University of Texas at Dallas, USA
,
Ram Krishnan
University of Texas at San Antonio, USA
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 March 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
adversarial
attacks
authorship attribution
imitation
obfuscation
open-source software
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate149of789submissions,19%
Upcoming Conference
CODASPY '24

Sponsor:

sigsac

Fourteenth ACM Conference on Data and Application Security and Privacy

June 19 - 21, 2024

Porto , Portugal
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 375
  Total Downloads
- Downloads (Last 12 months)29
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Adversarial Authorship Attribution in Open-Source Projects

CODASPY '19: Proceedings of the Ninth ACM Conference on Data and Application Security and Privacy

ABSTRACT

References

Cited By

Index Terms

Recommendations

Android authorship attribution through string analysis

Code Authorship Attribution: Methods and Challenges

AuthAttLyzer: A Robust defensive distillation-based Authorship Attribution framework

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Adversarial Authorship Attribution in Open-Source Projects

CODASPY '19: Proceedings of the Ninth ACM Conference on Data and Application Security and Privacy

ABSTRACT

References

Cited By

Index Terms

Recommendations

Android authorship attribution through string analysis

Code Authorship Attribution: Methods and Challenges

AuthAttLyzer: A Robust defensive distillation-based Authorship Attribution framework

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media