Duplicate code detection using anti-unification - PowerPoint PPT Presentation

About This Presentation
Title:

Duplicate code detection using anti-unification

Description:

1. I.D. Baxter, et.al. Clone Detection Using Abstract Syntax Trees, 1998. ... I. Baxter, et. al, 1998. Hash functions on subtrees, some kind of editing distance ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 21
Provided by: syrcose
Category:

less

Transcript and Presenter's Notes

Title: Duplicate code detection using anti-unification


1
Duplicate code detection using anti-unification
  • Peter Bulychev
  • Moscow State University
  • Marius Minea
  • Institute eAustria, Timisoara

2
Outline
  • Code duplication problem
  • Our anti-unification based algorithm
  • Comparison with existing methods
  • Clone Digger, the tool for finding software clones

3
What is software clone?
  • Two fragments of code form clone if they are
    similar enough (according to a given measure of
    similarity)

for(int i0 ilt5 i) for(j0 jlti j) cout ltlt ij for(int k0 klt6 k) for(m0 mltk m) cout ltlt km
4
Why is it important to detect code clones?
  • 5 - 20 of code in software systems are clones1
  • Why do programmers produce clones?2
  • Development strategy
  • Maintenance benefits
  • Overcoming underlying limitations
  • Cloning by accident
  • Why is the presence of code clones bad?
  • Errors in the original must be fixed in every
    clone
  • 1. I.D. Baxter, et.al. Clone Detection Using
    Abstract Syntax Trees, 1998.
  • 2. C.K. Roy and J.R. Cordy. A Survey on Software
    Clone Detection Research, 2007.

5
Our clone definition
  • Different clone definitions can be classified
    according to the level of granularity
  • List of strings
  • Sequence of tokens
  • Abstract syntax trees (AST)
  • Semantic information
  • We work on the AST level
  • We consider two sequences of statements a clone
    if one of them can be obtained from the other by
    replacing some subtrees

6
Example
x a y f(x,i) cout ltlt y x a b y f(x,j) cout ltlt y



cout


cout

x

y
f
x
a
y
f
y
y
a
b
x
j
x
i
7
Automatic clone detection tool
  • Detect occurrences of similar code
  • Applications
  • Refactoring into new functions or base classes
  • Number of clones can be used as a measure of code
    quality
  • Several tools exists1
  • 1. S. Bellon, et.al. Comparison and Evaluation of
    Clone Detection Tools, 2007.

8
The sketch of the algorithm
  • Partition similar statements into clusters
  • Find pairs of identical cluster sequences
  • Refine by examining identified code sequences for
    structural similarity

i0
i
f(i)
i0
f(k)
k
f(k)
k0
9
Main problems
  • How to compute similarity between two trees?
  • Use editing distance
  • How to compute similarity between a new tree and
    an existing tree cluster?
  • Comparing with each tree in cluster is expensive
  • Compare new tree with an average value stored for
    a cluster

10
Anti-unification
  • Anti-unifier of two trees is the most specific
    generalization that matches both

f
f
f



?

/
2
x
x
y
x
?
x
z
x
2
?
11
Anti-unification features
  • Anti-unifier of a set of trees keeps common
    features tree structure and common labels
  • Anti-unification can be used to compute editing
    distance between two trees
  • ?1 ? ?2 - substitutions, E0 ?1E1 ? E0 ?2E2
  • distance ?1 ?2

12
The first phasebuilding clusters of statements
  • We use a simple one-pass clustering algorithm
  • for each tree in statement trees
  • bestcluster argmax(cluster.add_cost(tree))
  • if bestcluster.add_cost(tree) lt threshold
  • bestcluster.append(tree)
  • else
  • clusters.append(new Cluster(tree))

13
Finding the best cluster
  • What add_cost function should we use? Cost value
    should be high for these cases
  • If cluster is large and by joining the new tree
    the clusters average value changes significantly
  • If the average value of the new cluster is far
    away from the tree
  • add_cost n (au - au) (tree - au)
  • n the old size of the cluster
  • au the old anti-unifier of the cluster
  • au - the new anti-unifier of the cluster

14
Increase of effectiveness
  • In order not to compare each AST with each other
    AST we use hashing. The upper parts of the trees
    are hashed.







a
b
x
0
a

x
0
b
c
15
Why is this not enough?
  • By considering pairs from the same cluster only
    individually we miss sequences of statements
  • We should find all pairs of identical cluster
    sequences and then check them for similarity

void f() // cluster ?1 cin gtgt i // cluster ?2 int j i 100 // cluster ?3 cout ltlt i ltlt j // cluster ?4 void f(int j) // cluster ?5 cin gtgt i // cluster ?2 int j i 100 // cluster ?3 cout ltlt j // cluster ?6
16
The second phasefinding all common subsequences
  • After the first phase each statement node is
    marked with the ID of its cluster
  • We want to find all pairs of similar sequences of
    cluster IDs
  • We do it using suffix trees
  • Only long common subsequences are considered

17
The third phasefinding similar sequences of
statements
i0
k3
f(i,k)
k0
n3
f(k,n)
i0
k3
f(i,k)
k0
n3
f(k,n)
18
Comparison with existing AST methods
  • W. Yang, 1991
  • Editing distance between two trees
  • I. Baxter, et. al, 1998
  • Hash functions on subtrees, some kind of editing
    distance
  • V. Wahler, 2004
  • Feature vectors comparison
  • S. Evans, et. al, 2007
  • Subtree patterns (similar to anti-unification),
    hash functions on subtrees

19
Clone Digger
  • The tool is written in Python
  • Supported languages
  • Python (ASTs are build using standard package
    compiler)
  • Java 1.5 (parser generator ANTLR)
  • The information on found clones is written to
    HTML with a highlighting of differences
  • Its application to open-source projects NLTK and
    BioPython showed, that they are 12 clones

20
Clone Digger
  • Provided under the GPL license and can be
    downloaded from the site
  • http//clonedigger.sourceforge.net

21
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com