duplicate code detection using clone digger peter bulychev lomonosov moscow state university cs...

16
Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department

Upload: sara-monroe

Post on 26-Mar-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department

Duplicate code detection using Clone Digger

Peter BulychevLomonosov Moscow

State UniversityCS department

Page 2: Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department

Outline Theoretic part

Clone detection problem in general The theory behind the tool

Practical part Clone Digger and the results of its

application to several Python open-source projects

Other ongoing projects

Page 3: Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department

What is software clone?

Two fragments of code form clone if they are similar enough (according to a given measure of similarity)

for i in range(5): for j in range(i): print i+j

for k in range(6): for m in range(k): print k+m

Page 4: Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department

Why is it important to detect code clones? 5% - 20% of code in software systems are

clones1

Why do programmers produce clones?2

Development strategy Maintenance benefits Overcoming underlying limitations Cloning by accident

Why is the presence of code clones bad? Errors in the original must be fixed in every clone

1. I.D. Baxter, et.al. Clone Detection Using Abstract Syntax Trees, 1998.2. C.K. Roy and J.R. Cordy. A Survey on Software Clone Detection Research,

2007.

Page 5: Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department

Our definition of clone Different clone definitions can be classified

according to the level of granularity: List of strings Sequence of tokens Abstract syntax trees (AST) Semantic information

We work on the AST level We consider two sequences of statements

as a clone if one of them can be obtained from the other by replacing some subtrees

Page 6: Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department

Example

x = ay = f(x,i)print y

x = a + by = f(x,j)print y

= print

x + y

a b

=

y f

x j

= print

x a y

=

y f

x i

block block

Page 7: Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department

The sketch of the algorithm

Partition similar statements into clusters

Find pairs of identical cluster sequences

Refine by examining identified code sequences for structural similarity

i=0 i+=1f(i)

k+=1 f(k)k=0

i=0 f(k)

Page 8: Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department

Main problems How to compute similarity between two

trees? Use editing distance

How to compute similarity between a new tree and an existing tree cluster? Comparing with each tree in cluster is

expensive Compare new tree with an average value

stored for a cluster

Page 9: Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department

Anti-unification Anti-unifier of two trees is the most

specific generalization that matches both of them

?

f

+ *?

x y x 2

f

+ /

x z x 2

f

+

x ?

Page 10: Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department

Anti-unification features

Anti-unifier of a set of trees keeps common features: the common upper part

Anti-unification can be used to compute editing distance between two trees:

Ө1 и Ө2 - substitutions, E0 Ө1=E1 и E0 Ө2=E2

distance = |Ө1| + |Ө2|

Page 11: Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department

Clone Digger

Is the first clone detection tool focused on Python (except Pylint)

Is provided under the GPL license Writes the information on found

clones to HTML in two column format with highlighting of differences

http://clonedigger.sourceforge.net

Page 12: Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department

Comparison with existing tools working with ASTs CloneDR by Semantic Designs, I.

Baxter, 1998 Hash functions on subtrees, some kind of

editing distance Asta by Microsoft Research, S. Evans,

et. al, 2007 Subtree patterns (similar to anti-unification),

hash functions on subtrees

Page 13: Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department

Quick Start

1. $ easy_install clonedigger2. $ clonedigger --recursive source_tree3. $ firefox output.html

Additional parameters such as thresholds can be also set (use --help to know more)

Page 14: Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department

Running on real-life open-source projects

BioPython 12.19%

NLTK 11.85%

Zope 27.41%

Plone 29.89%

These numbers mean nothing … … except that every large project has

clones and they should be detected

Page 15: Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department

What to do with found clones?

Remove clones by refactoring. Extract method and Pull Up method can be used

Detect library candidates Search for bugs

Page 16: Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department

Any questions?