ABSTRACT
Despite a decade of active research, there has been a marked lack in clone detection techniques that scale to large repositories for detecting near-miss clones. In this paper, we present a token-based clone detector, SourcererCC, that can detect both exact and near-miss clones from large inter-project repositories using a standard workstation. It exploits an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks: (1) a big benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (25K projects, 250MLOC) using a standard workstation.
- Software clone detection: A systematic review. Information and Software Technology, 55(7):1165--1199, 2013.Google ScholarCross Ref
- Cloc: Count lines of code. http://cloc.sourceforge.net, 2015.Google Scholar
- Ambient Software Evoluton Group. IJaDataset 2.0. http://secold.org/projects/seclone, January 2013.Google Scholar
- B. Baker. A program for identifying duplicated code. Computing Science and Statistics, pages 24--49, 1992.Google Scholar
- S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo. Comparison and evaluation of clone detection tools. Software Engineering, IEEE Transactions on, 33(9):577--591, Sept 2007. Google ScholarDigital Library
- A. Charpentier, J.-R. Falleri, D. Lo, and L. Réveillère. An empirical assessment of bellon's clone benchmark. In Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering, EASE '15, pages 20:1--20:10, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Proceedings of the 22Nd International Conference on Data Engineering, ICDE '06, pages 5--, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
- K. Chen, P. Liu, and Y. Zhang. Achieving accuracy and scalability simultaneously in detecting application clones on android markets. In Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, pages 175--186, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- J. Cordy. The txl programming language. http://www.txl.ca/.Google Scholar
- J. R. Cordy and C. K. Roy. The nicad clone detector. In Proceedings of the 2011 IEEE 19th International Conference on Program Comprehension, ICPC '11, pages 219--220, Washington, DC, USA, 2011. IEEE Computer Society. Google ScholarDigital Library
- J. Davies, D. German, M. Godfrey, and A. Hindle. Software Bertillonage: finding the provenance of an entity. In Proceedings of MSR, 2011. Google ScholarDigital Library
- D. M. German, M. D. Penta, Y. gal Guhneuc, and G. Antoniol. Code siblings: technical and legal implications of copying code between applications. In Mining Software Repositories, 2009. MSR '09. 6th IEEE International Working Conference on, 2009. Google ScholarDigital Library
- N. Gode and R. Koschke. Incremental clone detection. In Software Maintenance and Reengineering, 2009. CSMR '09. 13th European Conference on, pages 219--228, March 2009. Google ScholarDigital Library
- A. Hemel and R. Koschke. Reverse engineering variability in source code using clone detection: A case study for linux variants of consumer electronic devices. In Proceedings of Working Conference on Reverse Engineering, pages 357--366, 2012. Google ScholarDigital Library
- A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness of software. In Proceedings of the 34th International Conference on Software Engineering, ICSE '12, pages 837--847, Piscataway, NJ, USA, 2012. IEEE Press. Google ScholarDigital Library
- B. Hummel, E. Juergens, L. Heinemann, and M. Conradt. Index-based code clone detection:incremental, distributed, scalable. In Proceedings of ICSM, 2010. Google ScholarDigital Library
- T. Ishihara, K. Hotta, Y. Higo, H. Igaki, and S. Kusumoto. Inter-project functional clone detection toward building libraries - an empirical study on 13,000 projects. In Reverse Engineering (WCRE), 2012 19th Working Conference on, pages 387--391, Oct 2012. Google ScholarDigital Library
- L. Jiang, G. Misherghi, Z. Su, and S. Glondu. Deckard: Scalable and accurate tree-based detection of code clones. In Software Engineering, 2007. ICSE 2007. 29th International Conference on, pages 96--105, May 2007. Google ScholarDigital Library
- T. Kamiya, S. Kusumoto, and K. Inoue. Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. Software Engineering, IEEE Transactions on, 28(7):654--670, Jul 2002. Google ScholarDigital Library
- S. Kawaguchi, T. Yamashina, H. Uwano, K. Fushida, Y. Kamei, M. Nagura, and H. Iida. Shinobi: A tool for automatic code clone detection in the ide. volume 0, pages 313--314, Los Alamitos, CA, USA, 2009. IEEE Computer Society. Google ScholarDigital Library
- I. Keivanloo, J. Rilling, and P. Charland. Internet-scale real-time code clone search via multi-level indexing. In Proceedings of WCRE, 2011. Google ScholarDigital Library
- R. Koschke. Large-scale inter-system clone detection using suffix trees. In Proceedings of CSMR, pages 309--318, 2012. Google ScholarDigital Library
- M.-W. Lee, J.-W. Roh, S.-w. Hwang, and S. Kim. Instant code clone search. In Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE '10, pages 167--176, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- S. Livieri, Y. Higo, M. Matsushita, and K. Inoue. Very-large scale code clone analysis and visualization of open source programs using distributed ccfinder: D-ccfinder. In Proceedings of ICSE, 2007. Google ScholarDigital Library
- D. Rattan, R. Bhatia, and M. Singh. Software clone detection: A systematic review. Information and Software Technology, 55(7):1165--1199, 2013.Google ScholarCross Ref
- C. Roy and J. Cordy. A mutation/injection-based automatic framework for evaluating code clone detection tools. In Software Testing, Verification and Validation Workshops, 2009. ICSTW '09. International Conference on, pages 157--166, April 2009. Google ScholarDigital Library
- C. Roy, M. Zibran, and R. Koschke. The vision of software clone management: Past, present, and future (keynote paper). In Software Maintenance, Reengineering and Reverse Engineering (CSMR-WCRE), 2014 Software Evolution Week - IEEE Conference on, pages 18--33, Feb 2014.Google ScholarCross Ref
- C. K. Roy and J. R. Cordy. A survey on software clone detection research. (TR 2007-541), 2007. 115 pp.Google Scholar
- C. K. Roy and J. R. Cordy. Near-miss function clones in open source software: An empirical study. J. Softw. Maint. Evol., 22(3):165--189, Apr. 2010. Google ScholarDigital Library
- C. K. Roy, J. R. Cordy, and R. Koschke. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci. of Comput. Program., pages 577--591, 2009. Google ScholarDigital Library
- H. Sajnani and C. Lopes. A parallel and efficient approach to large scale code clone detection. In Proceedings of International Workshop on Software Clones, 2013. Google ScholarDigital Library
- J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia. Towards a big data curated benchmark of inter-project code clones. In Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution, ICSME '14, pages 476--480, Washington, DC, USA, 2014. IEEE Computer Society. Google ScholarDigital Library
- J. Svajlenko, I. Keivanloo, and C. Roy. Scaling classical clone detection tools for ultra-large datasets: An exploratory study. In Software Clones (IWSC), 2013 7th International Workshop on, pages 16--22, May 2013. Google ScholarDigital Library
- J. Svajlenko, I. Keivanloo, and C. K. Roy. Big data clone detection using classical detectors: an exploratory study. Journal of Software: Evolution and Process, 27(6):430--464, 2015.Google ScholarDigital Library
- J. Svajlenko and C. K. Roy. Evaluating modern clone detection tools. In ICSME, 2014. 10 pp. Google ScholarDigital Library
- J. Svajlenko and C. K. Roy. Evaluating clone detection tools with bigclonebench. In Proceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution, ICSME '15, page 10, 2015. Google ScholarDigital Library
- J. Svajlenko, C. K. Roy, and J. R. Cordy. A mutation analysis based benchmarking framework for clone detectors. In Proceedings of the 7th International Workshop on Software Clones, IWSC '13, pages 8--9, 2013. Google ScholarDigital Library
- A. Walenstein, N. Jyoti, J. Li, Y. Yang, and A. Lakhotia. Problems creating task-relevant clone detection reference data. In WCRE, pages 285--294, 2003. Google ScholarDigital Library
- T. Wang, M. Harman, Y. Jia, and J. Krinke. Searching for better configurations: A rigorous approach to clone evaluation. In ESEC/FSE, pages 455--465, 2013. Google ScholarDigital Library
- C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In Proceedings of the 17th International Conference on World Wide Web, WWW '08, pages 131--140, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- Y. Zhang, R. Jin, and Z.-H. Zhou. Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics, 1(1-4):43--52, 2010.Google ScholarCross Ref
- G. K. Zipf. Selective Studies and the Principle of Relative Frequency in Language.Google Scholar
Recommendations
CCAligner: a token based large-gap clone detector
ICSE '18: Proceedings of the 40th International Conference on Software EngineeringCopying code and then pasting with large number of edits is a common activity in software development, and the pasted code is a kind of complicated Type-3 clone. Due to large number of edits, we consider the clone as a large-gap clone. Large-gap clone ...
SourcererCC and SourcererCC-I: tools to detect clones in batch mode and during software development
ICSE '16: Proceedings of the 38th International Conference on Software Engineering CompanionGiven the availability of large source-code repositories, there has been a large number of applications for large-scale clone detection. Unfortunately, despite a decade of active research, there is a marked lack in clone detectors that scale to big ...
Application of fusion-fission to the multi-way graph partitioning problem
PPAM'07: Proceedings of the 7th international conference on Parallel processing and applied mathematicsThis paper presents an application of the Fusion-Fission method to the multi-way graph partitioning problem. The Fusion-Fission method was first designed to solve the normalized cut partitioning problem. Its application to the multi-way graph ...
Comments