skip to main content
10.1145/2884781.2884877acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

SourcererCC: scaling code clone detection to big-code

Published:14 May 2016Publication History

ABSTRACT

Despite a decade of active research, there has been a marked lack in clone detection techniques that scale to large repositories for detecting near-miss clones. In this paper, we present a token-based clone detector, SourcererCC, that can detect both exact and near-miss clones from large inter-project repositories using a standard workstation. It exploits an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks: (1) a big benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (25K projects, 250MLOC) using a standard workstation.

References

  1. Software clone detection: A systematic review. Information and Software Technology, 55(7):1165--1199, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  2. Cloc: Count lines of code. http://cloc.sourceforge.net, 2015.Google ScholarGoogle Scholar
  3. Ambient Software Evoluton Group. IJaDataset 2.0. http://secold.org/projects/seclone, January 2013.Google ScholarGoogle Scholar
  4. B. Baker. A program for identifying duplicated code. Computing Science and Statistics, pages 24--49, 1992.Google ScholarGoogle Scholar
  5. S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo. Comparison and evaluation of clone detection tools. Software Engineering, IEEE Transactions on, 33(9):577--591, Sept 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Charpentier, J.-R. Falleri, D. Lo, and L. Réveillère. An empirical assessment of bellon's clone benchmark. In Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering, EASE '15, pages 20:1--20:10, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Proceedings of the 22Nd International Conference on Data Engineering, ICDE '06, pages 5--, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. K. Chen, P. Liu, and Y. Zhang. Achieving accuracy and scalability simultaneously in detecting application clones on android markets. In Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, pages 175--186, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Cordy. The txl programming language. http://www.txl.ca/.Google ScholarGoogle Scholar
  10. J. R. Cordy and C. K. Roy. The nicad clone detector. In Proceedings of the 2011 IEEE 19th International Conference on Program Comprehension, ICPC '11, pages 219--220, Washington, DC, USA, 2011. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Davies, D. German, M. Godfrey, and A. Hindle. Software Bertillonage: finding the provenance of an entity. In Proceedings of MSR, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. M. German, M. D. Penta, Y. gal Guhneuc, and G. Antoniol. Code siblings: technical and legal implications of copying code between applications. In Mining Software Repositories, 2009. MSR '09. 6th IEEE International Working Conference on, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Gode and R. Koschke. Incremental clone detection. In Software Maintenance and Reengineering, 2009. CSMR '09. 13th European Conference on, pages 219--228, March 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Hemel and R. Koschke. Reverse engineering variability in source code using clone detection: A case study for linux variants of consumer electronic devices. In Proceedings of Working Conference on Reverse Engineering, pages 357--366, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness of software. In Proceedings of the 34th International Conference on Software Engineering, ICSE '12, pages 837--847, Piscataway, NJ, USA, 2012. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Hummel, E. Juergens, L. Heinemann, and M. Conradt. Index-based code clone detection:incremental, distributed, scalable. In Proceedings of ICSM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Ishihara, K. Hotta, Y. Higo, H. Igaki, and S. Kusumoto. Inter-project functional clone detection toward building libraries - an empirical study on 13,000 projects. In Reverse Engineering (WCRE), 2012 19th Working Conference on, pages 387--391, Oct 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Jiang, G. Misherghi, Z. Su, and S. Glondu. Deckard: Scalable and accurate tree-based detection of code clones. In Software Engineering, 2007. ICSE 2007. 29th International Conference on, pages 96--105, May 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Kamiya, S. Kusumoto, and K. Inoue. Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. Software Engineering, IEEE Transactions on, 28(7):654--670, Jul 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Kawaguchi, T. Yamashina, H. Uwano, K. Fushida, Y. Kamei, M. Nagura, and H. Iida. Shinobi: A tool for automatic code clone detection in the ide. volume 0, pages 313--314, Los Alamitos, CA, USA, 2009. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. I. Keivanloo, J. Rilling, and P. Charland. Internet-scale real-time code clone search via multi-level indexing. In Proceedings of WCRE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Koschke. Large-scale inter-system clone detection using suffix trees. In Proceedings of CSMR, pages 309--318, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M.-W. Lee, J.-W. Roh, S.-w. Hwang, and S. Kim. Instant code clone search. In Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE '10, pages 167--176, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Livieri, Y. Higo, M. Matsushita, and K. Inoue. Very-large scale code clone analysis and visualization of open source programs using distributed ccfinder: D-ccfinder. In Proceedings of ICSE, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Rattan, R. Bhatia, and M. Singh. Software clone detection: A systematic review. Information and Software Technology, 55(7):1165--1199, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  26. C. Roy and J. Cordy. A mutation/injection-based automatic framework for evaluating code clone detection tools. In Software Testing, Verification and Validation Workshops, 2009. ICSTW '09. International Conference on, pages 157--166, April 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. Roy, M. Zibran, and R. Koschke. The vision of software clone management: Past, present, and future (keynote paper). In Software Maintenance, Reengineering and Reverse Engineering (CSMR-WCRE), 2014 Software Evolution Week - IEEE Conference on, pages 18--33, Feb 2014.Google ScholarGoogle ScholarCross RefCross Ref
  28. C. K. Roy and J. R. Cordy. A survey on software clone detection research. (TR 2007-541), 2007. 115 pp.Google ScholarGoogle Scholar
  29. C. K. Roy and J. R. Cordy. Near-miss function clones in open source software: An empirical study. J. Softw. Maint. Evol., 22(3):165--189, Apr. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. K. Roy, J. R. Cordy, and R. Koschke. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci. of Comput. Program., pages 577--591, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. H. Sajnani and C. Lopes. A parallel and efficient approach to large scale code clone detection. In Proceedings of International Workshop on Software Clones, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia. Towards a big data curated benchmark of inter-project code clones. In Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution, ICSME '14, pages 476--480, Washington, DC, USA, 2014. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Svajlenko, I. Keivanloo, and C. Roy. Scaling classical clone detection tools for ultra-large datasets: An exploratory study. In Software Clones (IWSC), 2013 7th International Workshop on, pages 16--22, May 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Svajlenko, I. Keivanloo, and C. K. Roy. Big data clone detection using classical detectors: an exploratory study. Journal of Software: Evolution and Process, 27(6):430--464, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Svajlenko and C. K. Roy. Evaluating modern clone detection tools. In ICSME, 2014. 10 pp. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Svajlenko and C. K. Roy. Evaluating clone detection tools with bigclonebench. In Proceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution, ICSME '15, page 10, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. Svajlenko, C. K. Roy, and J. R. Cordy. A mutation analysis based benchmarking framework for clone detectors. In Proceedings of the 7th International Workshop on Software Clones, IWSC '13, pages 8--9, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Walenstein, N. Jyoti, J. Li, Y. Yang, and A. Lakhotia. Problems creating task-relevant clone detection reference data. In WCRE, pages 285--294, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. T. Wang, M. Harman, Y. Jia, and J. Krinke. Searching for better configurations: A rigorous approach to clone evaluation. In ESEC/FSE, pages 455--465, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In Proceedings of the 17th International Conference on World Wide Web, WWW '08, pages 131--140, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Y. Zhang, R. Jin, and Z.-H. Zhou. Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics, 1(1-4):43--52, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  42. G. K. Zipf. Selective Studies and the Principle of Relative Frequency in Language.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    ICSE '16: Proceedings of the 38th International Conference on Software Engineering
    May 2016
    1235 pages
    ISBN:9781450339001
    DOI:10.1145/2884781

    Copyright © 2016 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 14 May 2016

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate276of1,856submissions,15%

    Upcoming Conference

    ICSE 2025

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader