skip to main content
research-article
Open Access
Artifacts Available
Artifacts Evaluated & Functional

DéjàVu: a map of code duplicates on GitHub

Published:12 October 2017Publication History
Skip Abstract Section

Abstract

Previous studies have shown that there is a non-trivial amount of duplication in source code. This paper analyzes a corpus of 4.5 million non-fork projects hosted on GitHub representing over 428 million files written in Java, C++, Python, and JavaScript. We found that this corpus has a mere 85 million unique files. In other words, 70% of the code on GitHub consists of clones of previously created files. There is considerable variation between language ecosystems. JavaScript has the highest rate of file duplication, only 6% of the files are distinct. Java, on the other hand, has the least duplication, 60% of files are distinct. Lastly, a project-level analysis shows that between 9% and 31% of the projects contain at least 80% of files that can be found elsewhere. These rates of duplication have implications for systems built on open source software as well as for researchers interested in analyzing large code bases. As a concrete artifact of this study, we have created DéjàVu, a publicly available map of code duplicates in GitHub repositories.

Skip Supplemental Material Section

Supplemental Material

References

  1. T. F. Bissyande, F. Thung, D. Lo, L. Jiang, and L. Reveillere. 2013. Orion: A Sotware Project Search Engine with Integrated Diverse Sotware Artifacts. In International Conference on Engineering of Complex Computer Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Stephen M. Blackburn, Robin Garner, Chris Hofmann, Asjad M. Khan, Kathryn S. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony L. Hosking, Maria Jump, Han Bok Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanovic, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. 2006. The DaCapo benchmarks: Java benchmarking development and analysis. In Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Hudson Borges, André C. Hora, and Marco Tulio Valente. 2016. Understanding the Factors that Impact the Popularity of GitHub Repositories. (2016). http://arxiv.org/abs/1606.04984Google ScholarGoogle Scholar
  4. Casey Casalnuovo, Prem Devanbu, Abilio Oliveira, Vladimir Filkov, and Baishakhi Ray. 2015. Assert Use in GitHub Projects. In International Conference on Sotware Engineering (ICSE). http://dl.acm.org/citation.cfm?id=2818754.2818846Google ScholarGoogle Scholar
  5. James R. Cordy, Thomas R. Dean, and Nikita Synytskyy. 2004. Practical Language-independent Detection of Near-miss Clones. In Conference of the Centre for Advanced Studies on Collaborative Research (CASCON). http://dl.acm.org/citation. cfm?id=1034914.1034915Google ScholarGoogle Scholar
  6. V. Cosentino, J. L. C. Izquierdo, and J. Cabot. 2016. Findings from GitHub: Methods, Datasets and Limitations. In Working Conference on Mining Sotware Repositories (MSR). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. John W. Creswell. 2014. Research Design: ualitative, uantitative, and Mixed Methods Approaches. SAGE.Google ScholarGoogle Scholar
  8. Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen. 2013. Boa: A Language and Infrastructure for Analyzing Ultra-large-scale Sotware Repositories. In International Conference on Sotware Engineering (ICSE). http: //dl.acm.org/citation.cfm?id=2486788.2486844Google ScholarGoogle Scholar
  9. Jesus M. Gonzalez-Barahona, Gregorio Robles, and Santiago Dueñas. 2010. Collecting Data About FLOSS Development: The FLOSSMetrics Experience. In International Workshop on Emerging Trends in Free/Libre/Open Source Sotware Research and Development (FLOSS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Georgios Gousios. 2013. The GHTorrent dataset and tool suite. In Working Conference on Mining Sotware Repositories (MSR). Google ScholarGoogle ScholarCross RefCross Ref
  11. Lars Heinemann, Florian Deissenboeck, Mario Gleirscher, Benjamin Hummel, and Maximilian Irlbeck. 2011. On the Extent and Nature of Sotware Reuse in Open Source Java Projects. Berlin, Heidelberg. Google ScholarGoogle ScholarCross RefCross Ref
  12. Felipe Hofa. 2016. 400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs? (2016). https: //medium.com/@hofa/400-000-github-repositories-1-billion-iles-14-terabytes-of-code-spaces-or-tabs-7cfe0b5dd7fdGoogle ScholarGoogle Scholar
  13. Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M. German, and Daniela Damian. 2014. The Promises and Perils of Mining GitHub. In Working Conference on Mining Sotware Repositories (MSR). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A Multilinguistic Token-based Code Clone Detection System for Large Scale Source Code. IEEE Trans. Sotw. Eng. 28, 7 (2002). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. S. Kochhar, T. F. BissyandÃľ, D. Lo, and L. Jiang. 2013. Adoption of Sotware Testing in Open Source ProjectsśA Preliminary Study on 50,000 Projects. In European Conference on Sotware Maintenance and Reengineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Koschke. 2007. Survey of research on sotware clones. In Duplication, Redundancy, and Similarity in Sotware (Dagstuhl Seminar Proceedings 06301).Google ScholarGoogle Scholar
  17. A. Mockus. 2007. Large-Scale Code Reuse in Open Source Sotware. In First International Workshop on Emerging Trends in FLOSS Research and Development. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Mockus. 2009. Amassing and Indexing a Large Sample of Version Control Systems: Towards the Census of Public Source Code History. In Working Conference on Mining Sotware Repositories (MSR). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Meiyappan Nagappan, Thomas Zimmermann, and Christian Bird. 2013. Diversity in Sotware Engineering Research. In Foundations of Sotware Engineering (FSE). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Ossher, Sushil Bajracharya, E. Linstead, P. Baldi, and Crista Lopes. 2009. SourcererDB: An aggregated repository of statically analyzed and cross-linked open source Java projects. In Working Conference on Mining Sotware Repositories (MSR). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Joel Ossher, Hitesh Sajnani, and Cristina Lopes. 2011. File Cloning in Open Source Java Projects: The Good, the Bad, and the Ugly. In International Conference on Sotware Maintenance (ICSM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Baishakhi Ray, Daryl Posnet, Vladimir Filkov, and Premkumar Devanbu. 2014. A Large Scale Study of Programming Languages and Code uality in Github. In International Symposium on Foundations of Sotware Engineering (FSE). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Gregor Richards, Andreas Gal, Brendan Eich, and Jan Vitek. 2011. Automated Construction of JavaScript Benchmarks. In Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. K. Roy and J. R. Cordy. 2007. A survey on sotware clone detection research. Technical Report 541. ueens University.Google ScholarGoogle Scholar
  25. Chanchal K. Roy and James R. Cordy. 2009. A Mutation/Injection-Based Automatic Framework for Evaluating Code Clone Detection Tools. In International Conference on Sotware Testing, Verification, and Validation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C. K. Roy and J. R. Cordy. 2010. Near-miss Function Clones in Open Source Sotware: An Empirical Study. J. Sotw. Maint. Evol. 22, 3 (2010). Google ScholarGoogle ScholarCross RefCross Ref
  27. Hitesh Sajnani. 2016. Large-Scale Code Clone Detection. Ph.D. Dissertation. University of California, Irvine.Google ScholarGoogle Scholar
  28. Hitesh Sajnani, Vaibhav Saini, Jefrey Svajlenko, Chanchal K. Roy, and Cristina V. Lopes. 2016. SourcererCC: Scaling Code Clone Detection to Big-code. In International Conference on Sotware Engineering (ICSE). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Johnny Saldaña. 2009. The Coding Manual for ualitative Researchers. SAGE.Google ScholarGoogle Scholar
  30. SPEC. 1998. SPECjvm98 benchmarks. (1998).Google ScholarGoogle Scholar
  31. J. Svajlenko and C. K. Roy. 2015. Evaluating clone detection tools with BigCloneBench. In International Conference on Sotware Maintenance and Evolution (ICSME). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Christopher Vendome, Gabriele Bavota, Massimiliano Di Penta, Mario Linares-Vásquez, Daniel German, and Denys Poshyvanyk. 2016. License usage and changes: a large-scale study on GitHub. Empirical Sotware Engineering (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. DéjàVu: a map of code duplicates on GitHub

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader