ABSTRACT
Code clones are similar code fragments that occur at multiple locations in a software system. Detection of code clones provides useful information for maintenance, reengineering, program understanding and reuse. Several techniques have been proposed to detect code clones. These techniques differ in the code representation used for analysis of clones, ranging from plain text to parse trees and program dependence graphs. Clone detection based on lexical tokens involves minimal code transformation and gives good results, but is computationally expensive because of the large number of tokens that need to be compared. We explored string algorithms to find suitable data structures and algorithms for efficient token based clone detection and implemented them in our tool Repeated Tokens Finder (RTF). Instead of using suffix tree for string matching, we use more memory efficient suffix array. RTF incorporates a suffix array based linear time algorithm to detect string matches. It also provides a simple and customizable tokenization mechanism. Initial analysis and experiments show that our clone detection is simple, scalable, and performs better than the previous well-known tools.
- Abouelhoda, M. I., Kurtz, S., and Ohlebusch, E., "Replacing suffix trees with suffix arrays", Journal of Discrete Algorithms, vol. 2(1), 2004, pp. 53--86. Google ScholarDigital Library
- Baker, B. S., "A Program for Identifying Duplicated Code", Computing Science and Statistics, vol. 24, 1992, pp. 49--57.Google Scholar
- Basit, H. A., Rajapakse, D. C., and Jarzabek, S. "Beyond Templates: a Study of Clones in the STL and Some General Implications," In Proc. Int. Conf. on Software Engineering, (ICSE'05), St. Louis, USA, May 2005, pp. 451--459. Google ScholarDigital Library
- Basit, H. A., and Jarzabek, S. "Detecting Higher-level Similarity Patterns in Programs", In Proc. European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC-FSE'05), ACM Press, Lisbon, Portugal, September 2005, pp. 156--165. Google ScholarDigital Library
- Baxter, I. D., Yahin, A., Moura, L., Anna, M. S. and Bier, L., "Clone detection using abstract syntax trees," In Proc. Intl. Conference on Software Maintenance (ICSM '98), 1998, pp. 368--377. Google ScholarDigital Library
- Bellon, S., "Vergleich von Techniken zur Erkennung duplizierten Quellcodes", Master's Thesis, Institut fur Softwaretechnologie, Universitat Stuttgart, Stuttgart, Germany, 2002.Google Scholar
- Burd, E., and Bailey, J., "Evaluating Clone Detection Tools for Use during Preventative Maintenance", In Proc. 2nd IEEE Intl. Workshop on Source Code Analysis and Manipulation (SCAM'02), 2002, pp. 36--43. Google ScholarDigital Library
- Ducasse, S, Rieger, M., and Demeyer, S., "A language independent approach for detecting duplicated code," In Proc. Intl. Conference on Software Maintenance (ICSM '99), 1999. pp. 109--118. Google ScholarDigital Library
- Jarzabek, S. and Shubiao, L., "Eliminating Redundancies with a 'Composition with Adaptation' Meta-programming Technique", In Proc. European Soft. Eng. Conf. and ACM SIGSOFT Symp. on the Foundations of Soft. Eng. (ESEC-FSE'03), Helsinki, Sept. 2003, pp. 237--246. Google ScholarDigital Library
- Johnson, J. H., "Identifying redundancy in source code using fingerprints", In Proc. Conf. of the Centre for Advanced Studies on Collaborative research: software engineering (CASCON'93), 1993, pp 171--183. Google ScholarDigital Library
- Kamiya, T., Kusumoto, S, and Inoue, K., "CCFinder: A multi-linguistic token based code clone detection system for large scale source code," IEEE Trans. Software Engineering, vol. 28(7), July 2002, pp. 654 -- 670. Google ScholarDigital Library
- Kolpakov, R., Bana, G., and Kucherov, G., "mreps: efficient and flexible detection of tandem repeats in DNA", Nucleic Acids Research, vol. 31(13), Oxford University Press, 2003, 3672--3678.Google Scholar
- Komondoor, R., and Horwitz, S., "Using slicing to identify duplication in source code," In Proc. 8th International Symposium on Static Analysis, 2001, pp. 40--56. Google ScholarDigital Library
- Krinke, J., "Identifying Similar Code with Program Dependence Graphs", In Proc. 8th Working Conference on Reverse Engineering, Stuttgart, Germany, October 2001, pp. 301--309. Google ScholarDigital Library
- Koschke, R., Falke, R., and Frenzel, P. Clone Detection Using Abstract Syntax Suffix Trees. In Proceedings of the 13th Working Conference on Reverse Engineering (WCRE), pages 253--262, 2006. Google ScholarDigital Library
- Lague, B., Proulx, D., Merlo, E., Mayrand, J., and Hudepohl, J., "Assessing the benefits of incorporating function clone detection in a development process," Experience Report, Intl. Conference on Software Maintenance (ICSM '97), 1997, pp. 314--321. Google ScholarDigital Library
- Li, Z., Lu, S., Myagmar, S., Zhou, Y., CP-Miner: Finding copy-paste and related bugs in large-scale software code, IEEE Transactions on Software Engineering, vol. 32(3), 2006, pp. 176--192. Google ScholarDigital Library
- Linux Online, http://www.linux.org/, 2006.Google Scholar
- Manber, U., and Myers, G. W., "Suffix arrays: a new model for on-line string searches", SIAM Journal of Computing, vol. 22(5), 1993, pp. 935--948. Google ScholarDigital Library
- Mayrand J., Leblanc C., and Merlo E., "Experiment on the automatic detection of function clones in a software system using metrics", In Proc. Intl. Conference on Software Maintenance (ICSM '96), 1996, pp. 244--254. Google ScholarDigital Library
- McCreight E. M., "A space-economical suffix tree construction algorithm", Journal of the ACM, vol. 23(2), 1976, pp. 262--272. Google ScholarDigital Library
- XML-Based Variant Configuration Language- Technology for Reuse, http://xvcl.comp.nus.edu.sg, 2006.Google Scholar
Index Terms
- Efficient token based clone detection with flexible tokenization
Recommendations
Efficient token based clone detection with flexible tokenization
ESEC-FSE '07: Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineeringCode clones are similar code fragments that occur at multiple locations in a software system. Detection of code clones provides useful information for maintenance, reengineering, program understanding and reuse. Several techniques have been proposed to ...
Survey on Software Clone Detection Research
ICMSS 2019: Proceedings of the 2019 3rd International Conference on Management Engineering, Software Engineering and Service SciencesIn order to improve the efficiency of software development, developers often copy-paste code. It is found that the clone code may affect the quality of the software system, especially the maintenance and comprehension of the software, so it is necessary ...
Intelligent token-based code clone detection system for large scale source code
RACS '19: Proceedings of the Conference on Research in Adaptive and Convergent SystemsA code clone refers to code fragments in the source code that are identical or similar to each other. Code clones lead difficulties in software maintenance, bug fixing, present poor design and increase the system size. Code clone detection techniques ...
Comments