Duplicate code detection using Clone Digger - PowerPoint PPT Presentation

About This Presentation
Title:

Duplicate code detection using Clone Digger

Description:

Duplicate code detection using Clone Digger. Peter Bulychev. Lomonosov Moscow ... Two fragments of code form clone if they are similar enough (according to a ... – PowerPoint PPT presentation

Number of Views:498
Avg rating:3.0/5.0
Slides: 17
Provided by: clonedigge
Category:

less

Transcript and Presenter's Notes

Title: Duplicate code detection using Clone Digger


1
Duplicate code detection using Clone Digger
  • Peter Bulychev
  • Lomonosov Moscow State University
  • CS department

2
Outline
  • Theoretic part
  • Clone detection problem in general
  • The theory behind the tool
  • Practical part
  • Clone Digger and the results of its application
    to several Python open-source projects
  • Other ongoing projects

3
What is software clone?
  • Two fragments of code form clone if they are
    similar enough (according to a given measure of
    similarity)

for i in range(5) for j in range(i) print ij for k in range(6) for m in range(k) print km
4
Why is it important to detect code clones?
  • 5 - 20 of code in software systems are clones1
  • Why do programmers produce clones?2
  • Development strategy
  • Maintenance benefits
  • Overcoming underlying limitations
  • Cloning by accident
  • Why is the presence of code clones bad?
  • Errors in the original must be fixed in every
    clone
  • 1. I.D. Baxter, et.al. Clone Detection Using
    Abstract Syntax Trees, 1998.
  • 2. C.K. Roy and J.R. Cordy. A Survey on Software
    Clone Detection Research, 2007.

5
Our definition of clone
  • Different clone definitions can be classified
    according to the level of granularity
  • List of strings
  • Sequence of tokens
  • Abstract syntax trees (AST)
  • Semantic information
  • We work on the AST level
  • We consider two sequences of statements as a
    clone if one of them can be obtained from the
    other by replacing some subtrees

6
Example
x a y f(x,i) print y x a b y f(x,j) print y
block
block

print


print

x

y
f
x
a
y
f
y
y
a
b
x
j
x
i
7
The sketch of the algorithm
  • Partition similar statements into clusters
  • Find pairs of identical cluster sequences
  • Refine by examining identified code sequences for
    structural similarity

i0
i1
f(i)
i0
f(k)
k1
f(k)
k0
8
Main problems
  • How to compute similarity between two trees?
  • Use editing distance
  • How to compute similarity between a new tree and
    an existing tree cluster?
  • Comparing with each tree in cluster is expensive
  • Compare new tree with an average value stored for
    a cluster

9
Anti-unification
  • Anti-unifier of two trees is the most specific
    generalization that matches both of them

f
f
f



?

/
2
x
x
y
x
?
x
z
x
2
?
10
Anti-unification features
  • Anti-unifier of a set of trees keeps common
    features the common upper part
  • Anti-unification can be used to compute editing
    distance between two trees
  • ?1 ? ?2 - substitutions, E0 ?1E1 ? E0 ?2E2
  • distance ?1 ?2

11
Clone Digger
  • Is the first clone detection tool focused on
    Python (except Pylint)
  • Is provided under the GPL license
  • Writes the information on found clones to HTML in
    two column format with highlighting of
    differences
  • http//clonedigger.sourceforge.net

12
Comparison with existing tools working with ASTs
  • CloneDR by Semantic Designs, I. Baxter, 1998
  • Hash functions on subtrees, some kind of editing
    distance
  • Asta by Microsoft Research, S. Evans, et. al,
    2007
  • Subtree patterns (similar to anti-unification),
    hash functions on subtrees

13
Quick Start
  • easy_install clonedigger
  • clonedigger --recursive source_tree
  • firefox output.html
  • Additional parameters such as thresholds can be
    also set (use --help to know more)

14
Running on real-life open-source projects
BioPython 12.19
NLTK 11.85
Zope 27.41
Plone 29.89
  • These numbers mean nothing
  • except that every large project has clones and
    they should be detected

15
What to do with found clones?
  • Remove clones by refactoring. Extract method and
    Pull Up method can be used
  • Detect library candidates
  • Search for bugs

16
  • Any questions?
Write a Comment
User Comments (0)
About PowerShow.com