Duplicate code detection using anti-unification

About This Presentation

Title:

Duplicate code detection using anti-unification

Description:

1. I.D. Baxter, et.al. Clone Detection Using Abstract Syntax Trees, 1998. ... I. Baxter, et. al, 1998. Hash functions on subtrees, some kind of editing distance ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 21

Provided by: syrcose

Category:

more less

Transcript and Presenter's Notes

Title: Duplicate code detection using anti-unification

1
Duplicate code detection using anti-unification

Peter Bulychev
Moscow State University
Marius Minea
Institute eAustria, Timisoara

2
Outline

Code duplication problem
Our anti-unification based algorithm
Comparison with existing methods
Clone Digger, the tool for finding software clones

3
What is software clone?

Two fragments of code form clone if they are
similar enough (according to a given measure of
similarity)

for(int i0 ilt5 i) for(j0 jlti j) cout ltlt ij for(int k0 klt6 k) for(m0 mltk m) cout ltlt km
4
Why is it important to detect code clones?

5 - 20 of code in software systems are clones1
Why do programmers produce clones?2
Development strategy
Maintenance benefits
Overcoming underlying limitations
Cloning by accident
Why is the presence of code clones bad?
Errors in the original must be fixed in every
clone
1. I.D. Baxter, et.al. Clone Detection Using
Abstract Syntax Trees, 1998.
2. C.K. Roy and J.R. Cordy. A Survey on Software
Clone Detection Research, 2007.

5
Our clone definition

Different clone definitions can be classified
according to the level of granularity
List of strings
Sequence of tokens
Abstract syntax trees (AST)
Semantic information
We work on the AST level
We consider two sequences of statements a clone
if one of them can be obtained from the other by
replacing some subtrees

6
Example
x a y f(x,i) cout ltlt y x a b y f(x,j) cout ltlt y

cout

cout

x

y
f
x
a
y
f
y
y
a
b
x
j
x
i
7
Automatic clone detection tool

Detect occurrences of similar code
Applications
Refactoring into new functions or base classes
Number of clones can be used as a measure of code
quality
Several tools exists1
1. S. Bellon, et.al. Comparison and Evaluation of
Clone Detection Tools, 2007.

8
The sketch of the algorithm

Partition similar statements into clusters
Find pairs of identical cluster sequences
Refine by examining identified code sequences for
structural similarity

i0
i
f(i)
i0
f(k)
k
f(k)
k0
9
Main problems

How to compute similarity between two trees?
Use editing distance
How to compute similarity between a new tree and
an existing tree cluster?
Comparing with each tree in cluster is expensive
Compare new tree with an average value stored for
a cluster

10
Anti-unification

Anti-unifier of two trees is the most specific
generalization that matches both

f
f
f

?

/
2
x
x
y
x
?
x
z
x
2
?
11
Anti-unification features

Anti-unifier of a set of trees keeps common
features tree structure and common labels
Anti-unification can be used to compute editing
distance between two trees
?1 ? ?2 - substitutions, E0 ?1E1 ? E0 ?2E2
distance ?1 ?2

12
The first phasebuilding clusters of statements

We use a simple one-pass clustering algorithm
for each tree in statement trees
bestcluster argmax(cluster.add_cost(tree))
if bestcluster.add_cost(tree) lt threshold
bestcluster.append(tree)
else
clusters.append(new Cluster(tree))

13
Finding the best cluster

What add_cost function should we use? Cost value
should be high for these cases
If cluster is large and by joining the new tree
the clusters average value changes significantly
If the average value of the new cluster is far
away from the tree
add_cost n (au - au) (tree - au)
n the old size of the cluster
au the old anti-unifier of the cluster
au - the new anti-unifier of the cluster

14
Increase of effectiveness

In order not to compare each AST with each other
AST we use hashing. The upper parts of the trees
are hashed.

a
b
x
0
a

x
0
b
c
15
Why is this not enough?

By considering pairs from the same cluster only
individually we miss sequences of statements
We should find all pairs of identical cluster
sequences and then check them for similarity

void f() // cluster ?1 cin gtgt i // cluster ?2 int j i 100 // cluster ?3 cout ltlt i ltlt j // cluster ?4 void f(int j) // cluster ?5 cin gtgt i // cluster ?2 int j i 100 // cluster ?3 cout ltlt j // cluster ?6
16
The second phasefinding all common subsequences

After the first phase each statement node is
marked with the ID of its cluster
We want to find all pairs of similar sequences of
cluster IDs
We do it using suffix trees
Only long common subsequences are considered

17
The third phasefinding similar sequences of
statements
i0
k3
f(i,k)
k0
n3
f(k,n)
i0
k3
f(i,k)
k0
n3
f(k,n)
18
Comparison with existing AST methods

W. Yang, 1991
Editing distance between two trees
I. Baxter, et. al, 1998
Hash functions on subtrees, some kind of editing
distance
V. Wahler, 2004
Feature vectors comparison
S. Evans, et. al, 2007
Subtree patterns (similar to anti-unification),
hash functions on subtrees

19
Clone Digger

The tool is written in Python
Supported languages
Python (ASTs are build using standard package
compiler)
Java 1.5 (parser generator ANTLR)
The information on found clones is written to
HTML with a highlighting of differences
Its application to open-source projects NLTK and
BioPython showed, that they are 12 clones

20
Clone Digger