Title: Duplicate code detection using anti-unification
1Duplicate code detection using anti-unification
- Peter Bulychev
- Moscow State University
- Marius Minea
- Institute eAustria, Timisoara
2Outline
- Code duplication problem
- Our anti-unification based algorithm
- Comparison with existing methods
- Clone Digger, the tool for finding software clones
3What is software clone?
- Two fragments of code form clone if they are
similar enough (according to a given measure of
similarity)
for(int i0 ilt5 i) for(j0 jlti j) cout ltlt ij for(int k0 klt6 k) for(m0 mltk m) cout ltlt km
4Why is it important to detect code clones?
- 5 - 20 of code in software systems are clones1
- Why do programmers produce clones?2
- Development strategy
- Maintenance benefits
- Overcoming underlying limitations
- Cloning by accident
- Why is the presence of code clones bad?
- Errors in the original must be fixed in every
clone - 1. I.D. Baxter, et.al. Clone Detection Using
Abstract Syntax Trees, 1998. - 2. C.K. Roy and J.R. Cordy. A Survey on Software
Clone Detection Research, 2007.
5Our clone definition
- Different clone definitions can be classified
according to the level of granularity - List of strings
- Sequence of tokens
- Abstract syntax trees (AST)
- Semantic information
- We work on the AST level
- We consider two sequences of statements a clone
if one of them can be obtained from the other by
replacing some subtrees
6Example
x a y f(x,i) cout ltlt y x a b y f(x,j) cout ltlt y
cout
cout
x
y
f
x
a
y
f
y
y
a
b
x
j
x
i
7Automatic clone detection tool
- Detect occurrences of similar code
- Applications
- Refactoring into new functions or base classes
- Number of clones can be used as a measure of code
quality - Several tools exists1
- 1. S. Bellon, et.al. Comparison and Evaluation of
Clone Detection Tools, 2007.
8The sketch of the algorithm
- Partition similar statements into clusters
- Find pairs of identical cluster sequences
- Refine by examining identified code sequences for
structural similarity
i0
i
f(i)
i0
f(k)
k
f(k)
k0
9Main problems
- How to compute similarity between two trees?
- Use editing distance
- How to compute similarity between a new tree and
an existing tree cluster? - Comparing with each tree in cluster is expensive
- Compare new tree with an average value stored for
a cluster
10Anti-unification
- Anti-unifier of two trees is the most specific
generalization that matches both
f
f
f
?
/
2
x
x
y
x
?
x
z
x
2
?
11Anti-unification features
- Anti-unifier of a set of trees keeps common
features tree structure and common labels - Anti-unification can be used to compute editing
distance between two trees - ?1 ? ?2 - substitutions, E0 ?1E1 ? E0 ?2E2
- distance ?1 ?2
12The first phasebuilding clusters of statements
- We use a simple one-pass clustering algorithm
- for each tree in statement trees
- bestcluster argmax(cluster.add_cost(tree))
- if bestcluster.add_cost(tree) lt threshold
- bestcluster.append(tree)
- else
- clusters.append(new Cluster(tree))
13Finding the best cluster
- What add_cost function should we use? Cost value
should be high for these cases - If cluster is large and by joining the new tree
the clusters average value changes significantly - If the average value of the new cluster is far
away from the tree - add_cost n (au - au) (tree - au)
- n the old size of the cluster
- au the old anti-unifier of the cluster
- au - the new anti-unifier of the cluster
14Increase of effectiveness
- In order not to compare each AST with each other
AST we use hashing. The upper parts of the trees
are hashed.
a
b
x
0
a
x
0
b
c
15Why is this not enough?
- By considering pairs from the same cluster only
individually we miss sequences of statements - We should find all pairs of identical cluster
sequences and then check them for similarity
void f() // cluster ?1 cin gtgt i // cluster ?2 int j i 100 // cluster ?3 cout ltlt i ltlt j // cluster ?4 void f(int j) // cluster ?5 cin gtgt i // cluster ?2 int j i 100 // cluster ?3 cout ltlt j // cluster ?6
16The second phasefinding all common subsequences
- After the first phase each statement node is
marked with the ID of its cluster - We want to find all pairs of similar sequences of
cluster IDs - We do it using suffix trees
- Only long common subsequences are considered
17The third phasefinding similar sequences of
statements
i0
k3
f(i,k)
k0
n3
f(k,n)
i0
k3
f(i,k)
k0
n3
f(k,n)
18Comparison with existing AST methods
- W. Yang, 1991
- Editing distance between two trees
- I. Baxter, et. al, 1998
- Hash functions on subtrees, some kind of editing
distance - V. Wahler, 2004
- Feature vectors comparison
- S. Evans, et. al, 2007
- Subtree patterns (similar to anti-unification),
hash functions on subtrees
19Clone Digger
- The tool is written in Python
- Supported languages
- Python (ASTs are build using standard package
compiler) - Java 1.5 (parser generator ANTLR)
- The information on found clones is written to
HTML with a highlighting of differences - Its application to open-source projects NLTK and
BioPython showed, that they are 12 clones
20Clone Digger
- Provided under the GPL license and can be
downloaded from the site - http//clonedigger.sourceforge.net
21