skip to main content
10.1145/1295014.1295029acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
Article

Efficient token based clone detection with flexible tokenization

Published:03 September 2007Publication History

ABSTRACT

Code clones are similar code fragments that occur at multiple locations in a software system. Detection of code clones provides useful information for maintenance, reengineering, program understanding and reuse. Several techniques have been proposed to detect code clones. These techniques differ in the code representation used for analysis of clones, ranging from plain text to parse trees and program dependence graphs. Clone detection based on lexical tokens involves minimal code transformation and gives good results, but is computationally expensive because of the large number of tokens that need to be compared. We explored string algorithms to find suitable data structures and algorithms for efficient token based clone detection and implemented them in our tool Repeated Tokens Finder (RTF). Instead of using suffix tree for string matching, we use more memory efficient suffix array. RTF incorporates a suffix array based linear time algorithm to detect string matches. It also provides a simple and customizable tokenization mechanism. Initial analysis and experiments show that our clone detection is simple, scalable, and performs better than the previous well-known tools.

References

  1. Abouelhoda, M. I., Kurtz, S., and Ohlebusch, E., "Replacing suffix trees with suffix arrays", Journal of Discrete Algorithms, vol. 2(1), 2004, pp. 53--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Baker, B. S., "A Program for Identifying Duplicated Code", Computing Science and Statistics, vol. 24, 1992, pp. 49--57.Google ScholarGoogle Scholar
  3. Basit, H. A., Rajapakse, D. C., and Jarzabek, S. "Beyond Templates: a Study of Clones in the STL and Some General Implications," In Proc. Int. Conf. on Software Engineering, (ICSE'05), St. Louis, USA, May 2005, pp. 451--459. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Basit, H. A., and Jarzabek, S. "Detecting Higher-level Similarity Patterns in Programs", In Proc. European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC-FSE'05), ACM Press, Lisbon, Portugal, September 2005, pp. 156--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Baxter, I. D., Yahin, A., Moura, L., Anna, M. S. and Bier, L., "Clone detection using abstract syntax trees," In Proc. Intl. Conference on Software Maintenance (ICSM '98), 1998, pp. 368--377. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bellon, S., "Vergleich von Techniken zur Erkennung duplizierten Quellcodes", Master's Thesis, Institut fur Softwaretechnologie, Universitat Stuttgart, Stuttgart, Germany, 2002.Google ScholarGoogle Scholar
  7. Burd, E., and Bailey, J., "Evaluating Clone Detection Tools for Use during Preventative Maintenance", In Proc. 2nd IEEE Intl. Workshop on Source Code Analysis and Manipulation (SCAM'02), 2002, pp. 36--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ducasse, S, Rieger, M., and Demeyer, S., "A language independent approach for detecting duplicated code," In Proc. Intl. Conference on Software Maintenance (ICSM '99), 1999. pp. 109--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jarzabek, S. and Shubiao, L., "Eliminating Redundancies with a 'Composition with Adaptation' Meta-programming Technique", In Proc. European Soft. Eng. Conf. and ACM SIGSOFT Symp. on the Foundations of Soft. Eng. (ESEC-FSE'03), Helsinki, Sept. 2003, pp. 237--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Johnson, J. H., "Identifying redundancy in source code using fingerprints", In Proc. Conf. of the Centre for Advanced Studies on Collaborative research: software engineering (CASCON'93), 1993, pp 171--183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Kamiya, T., Kusumoto, S, and Inoue, K., "CCFinder: A multi-linguistic token based code clone detection system for large scale source code," IEEE Trans. Software Engineering, vol. 28(7), July 2002, pp. 654 -- 670. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Kolpakov, R., Bana, G., and Kucherov, G., "mreps: efficient and flexible detection of tandem repeats in DNA", Nucleic Acids Research, vol. 31(13), Oxford University Press, 2003, 3672--3678.Google ScholarGoogle Scholar
  13. Komondoor, R., and Horwitz, S., "Using slicing to identify duplication in source code," In Proc. 8th International Symposium on Static Analysis, 2001, pp. 40--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Krinke, J., "Identifying Similar Code with Program Dependence Graphs", In Proc. 8th Working Conference on Reverse Engineering, Stuttgart, Germany, October 2001, pp. 301--309. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Koschke, R., Falke, R., and Frenzel, P. Clone Detection Using Abstract Syntax Suffix Trees. In Proceedings of the 13th Working Conference on Reverse Engineering (WCRE), pages 253--262, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Lague, B., Proulx, D., Merlo, E., Mayrand, J., and Hudepohl, J., "Assessing the benefits of incorporating function clone detection in a development process," Experience Report, Intl. Conference on Software Maintenance (ICSM '97), 1997, pp. 314--321. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Li, Z., Lu, S., Myagmar, S., Zhou, Y., CP-Miner: Finding copy-paste and related bugs in large-scale software code, IEEE Transactions on Software Engineering, vol. 32(3), 2006, pp. 176--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Linux Online, http://www.linux.org/, 2006.Google ScholarGoogle Scholar
  19. Manber, U., and Myers, G. W., "Suffix arrays: a new model for on-line string searches", SIAM Journal of Computing, vol. 22(5), 1993, pp. 935--948. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mayrand J., Leblanc C., and Merlo E., "Experiment on the automatic detection of function clones in a software system using metrics", In Proc. Intl. Conference on Software Maintenance (ICSM '96), 1996, pp. 244--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. McCreight E. M., "A space-economical suffix tree construction algorithm", Journal of the ACM, vol. 23(2), 1976, pp. 262--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. XML-Based Variant Configuration Language- Technology for Reuse, http://xvcl.comp.nus.edu.sg, 2006.Google ScholarGoogle Scholar

Index Terms

  1. Efficient token based clone detection with flexible tokenization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ESEC-FSE companion '07: The 6th Joint Meeting on European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering: companion papers
      September 2007
      189 pages
      ISBN:9781595938121
      DOI:10.1145/1295014

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 3 September 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate112of543submissions,21%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader