1 gplag: detection of software plagiarism by program dependence graph analysis chao liu, chen chen,...
TRANSCRIPT
![Page 1: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/1.jpg)
1
GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis
Chao Liu, Chen Chen,
Jiawei Han, Philip S. Yu
University of Illinois at Urbana-Champaign
IBM T.J. Waston Research Center
Presented by Chao Liu
![Page 2: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/2.jpg)
2
Motivations Blossom of open-source projects
SourceForge.net: 125,090 projects as July 2006 Convenience for software plagiarism?
You can always find something online Core-part plagiarism
Ripping off GUIs and irrelevant parts (Illegally) reuse the implementations of core-
algorithms Our goal
Efficient detection of core-part plagiarism
![Page 3: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/3.jpg)
3
Challenges
Effectiveness Professional plagiarists Automated plagiarism
Efficiency Only a small part of code is plagiarized, how
to detect it efficiently?
![Page 4: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/4.jpg)
4
Outline
Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions
![Page 5: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/5.jpg)
5
Original Program
01 static void
02 make_blank (struct line *blank, int count)
03 {
04 int i;
05 unsigned char *buffer;
06 struct field *fields;
07 blank->nfields = count;
08 blank->buf.size = blank->buf.length = count + 1;
09 blank->buf.buffer = (char*) xmalloc (blank->buf.size);
10 buffer = (unsigned char *) blank->buf.buffer;
11 blank->fields = fields =
(struct field *) xmalloc (sizeof (struct field) * count);
12 for (i = 0; i < count; i++){
13 ...
14 }
15 }
A procedure in a program, called join
![Page 6: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/6.jpg)
6
Disguise 1: Format Alteration
01 static void
02 make_blank (struct line *blank, int count)
03 {
04 int i;
05 unsigned char *buffer;
06 struct field *fields;
07 blank->nfields = count; // initialization
08 blank->buf.size = blank->buf.length = count + 1;
09 blank->buf.buffer = (char*) xmalloc (blank->buf.size);
10 buffer = (unsigned char *) blank->buf.buffer;
11 blank->fields = fields =
(struct field *) xmalloc (sizeof (struct field) * count);
12 for (i = 0; i < count; i++){
13 ...
14 }
15 }
Insert comments and blanks
![Page 7: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/7.jpg)
7
Disguise 2: Identifier Renaming
01 static void
02 fill_content (struct line *fill, int num)
03 {
04 int i;
05 unsigned char *buffer;
06 struct field *fields;
07 fill->nfields = num; // initialization
08 fill->buf.size = fill->buf.length = num + 1;
09 fill->buf.buffer = (char*) xmalloc (fill->buf.size);
10 buffer = (unsigned char *) fill->buf.buffer;
11 fill->fields = fields =
(struct field *) xmalloc (sizeof (struct field) * num);
12 for (i = 0; i < num; i++){
13 ...
14 }
15 }
Rename variables consistently
![Page 8: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/8.jpg)
8
Disguise 3: Statement Reordering
01 static void
02 fill_content (struct line *fill, int num)
03 {
04 int i;
05 unsigned char *buffer;
06 struct field *fields;
11 fill->fields = fields =
(struct field *) xmalloc (sizeof (struct field) * num);
08 fill->buf.size = fill->buf.length = num + 1;
09 fill->buf.buffer = (char*) xmalloc (fill->buf.size);
10 buffer = (unsigned char *) fill->buf.buffer;
07 fill->nfields = num; // initialization
12 for (i = 0; i < num; i++){
13 ...
14 }
15 }
Reorder non-dependent statements
![Page 9: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/9.jpg)
9
Disguise 4: Control Replacement
01 static void
02 fill_content (struct line *fill, int num)
03 {
04 int i;
05 unsigned char *buffer;
06 struct field *fields;
11 fill->fields = fields =
(struct field *) xmalloc (sizeof (struct field) * num);
08 fill->buf.size = fill->buf.length = num + 1;
09 fill->buf.buffer = (char*) xmalloc (fill->buf.size);
10 buffer = (unsigned char *) fill->buf.buffer;
07 fill->nfields = num; // initialization
12 i = 0;
13 while (i < num){
14 ...
15 i++;
16 }
17 }
Use equivalent control structure
![Page 10: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/10.jpg)
10
Disguise 5: Code Insertion
01 static void
02 fill_content (struct line *fill, int num)
03 {
04 int i;
05 unsigned char *buffer;
06 struct field *fields;
11 fill->fields = fields =
(struct field *) xmalloc (sizeof (struct field) * num);
08 fill->buf.size = fill->buf.length = num + 1;
09 fill->buf.buffer = (char*) xmalloc (fill->buf.size);
10 buffer = (unsigned char *) fill->buf.buffer;
07 fill->nfields = num; // initialization
12 i = 0;
13 while (i < num){
14 ... for (int j = 0; j < i; j++);
15 i++;
16 }
17 }
Insert immaterial code
![Page 11: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/11.jpg)
11
Fully Disguised
01 static void02 make_blank (struct line *blank, int count)03 {04 int i;05 unsigned char *buffer;06 struct field *fields;
07 blank->nfields = count;08 blank->buf.size = blank->buf.length = count + 1;09 blank->buf.buffer = (char*) xmalloc (blank->buf.size);10 buffer = (unsigned char *) blank->buf.buffer;11 blank->fields = fields =
(struct field *) xmalloc (sizeof (struct field) * count);
12 for (i = 0; i < count; i++){13 ...14 }15 }
Original C ode
01 static void02 fill_content(int num, struct line* fill)03 {04 (*fill).store.size = fill->store.length = num + 1;05 struct field *tabs;06 (*fill).fields = tabs = (struct field *) xmalloc (sizeof (struct field) * num);07 (*fill).store.buffer = (char*) xmalloc (fill->store.size);08 (*fill).ntabs = num;09 unsigned char *pb;10 pb = (unsigned char *) (*fill).store.buffer;
11 int idx = 0;12 while(idx < num){ // fill in the storage13 ...14 for(int j = 0; j < idx; j++)15 ...16 idx++;17 }18 }
P lagiar ized C ode
![Page 12: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/12.jpg)
12
Outline
Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions
![Page 13: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/13.jpg)
13
Review of Plagiarism Detection String-based [Baker et al. 1995]
A program represented as a string Blanks and comments ignored.
AST-based [Baxter et al. 1998, Kontogiannis et al. 1995] A program is represented as an Abstract Syntax Tree (AST) Fragile to statement reordering, control replacement and
code insertion Token-based [Kamiya et al. 2002, Prechelt et al. 2002]
Variables of the same type are mapped to the same token A program is represented as a token string Fingerprint of token strings is used for robustness [Schleimer
et al. 2003] Partially robust to statement reordering, control replacement
and code insertion Representatives: Moss and JPlag
![Page 14: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/14.jpg)
14
Outline
Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions
![Page 15: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/15.jpg)
15
Graphic representation of source code
int sum(int array[], int count)
{
int i, sum;
sum = 0;
for(i = 0; i < count; i++){
sum = add(sum, array[i]);
}
return sum;
}
int add(int a, int b)
{
return a + b;
}
![Page 16: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/16.jpg)
16
Graphic representation of source code
int sum(int array[], int count)
{
int i, sum;
sum = 0;
for(i = 0; i < count; i++){
sum = add(sum, array[i]);
}
return sum;
}
int add(int a, int b)
{
return a + b;
}
![Page 17: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/17.jpg)
17
Control Dependency
int sum(int array[], int count)
{
int i, sum;
sum = 0;
for(i = 0; i < count; i++){
sum = add(sum, array[i]);
}
return sum;
}
int add(int a, int b)
{
return a + b;
}
![Page 18: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/18.jpg)
18
Data Dependency
int sum(int array[], int count)
{
int i, sum;
sum = 0;
for(i = 0; i < count; i++){
sum = add(sum, array[i]);
}
return sum;
}
int add(int a, int b)
{
return a + b;
}
![Page 19: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/19.jpg)
19
Plagiarism Detectible?
01 static void02 make_blank (struct line *blank, int count)03 {04 int i;05 unsigned char *buffer;06 struct field *fields;
07 blank->nfields = count;08 blank->buf.size = blank->buf.length = count + 1;09 blank->buf.buffer = (char*) xmalloc (blank->buf.size);10 buffer = (unsigned char *) blank->buf.buffer;11 blank->fields = fields =
(struct field *) xmalloc (sizeof (struct field) * count);
12 for (i = 0; i < count; i++){13 ...14 }15 }
Original C ode
01 static void02 fill_content(int num, struct line* fill)03 {04 (*fill).store.size = fill->store.length = num + 1;05 struct field *tabs;06 (*fill).fields = tabs = (struct field *) xmalloc (sizeof (struct field) * num);07 (*fill).store.buffer = (char*) xmalloc (fill->store.size);08 (*fill).ntabs = num;09 unsigned char *pb;10 pb = (unsigned char *) (*fill).store.buffer;
11 int idx = 0;12 while(idx < num){ // fill in the storage13 ...14 for(int j = 0; j < idx; j++)15 ...16 idx++;17 }18 }
P lagiar ized C ode
![Page 20: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/20.jpg)
20
Corresponding PDGs
3: dec l.,line* blank
8: dec l.,int c ount
12: dec l.,int i
13: ass ign,i = 0
14: inc .,i++
15: c ontro li < c ount
9: as s ig n ,b lan k->b u f.s iz e
= b lan k->...
7: as s ig n ,b lan k->n field s =
co u n t
4: as s ig n ,b lan k->b u f.b u ffer = (ch ai*) xm..
0: as s ig n ,b lan k->field s =
field s = ...
10: as s ig n , b u ffer= (u n s ig n ed ) ...
11: d ec l.,c har* b uffer
5: d ec l.,s tru c t field *
field s
1: as s ig n ,field s =
(s tru c t ...
2: c all-s ite,xmalloc ()
6: c all-s ite,xmalloc ()
3: dec l.,l ine* fi l l
8: dec l.,int num
12: dec l.,int idx
13: ass ign,idx = 0
14: inc .,idx++
15: c ontro lw hile(id x < num)
9: as s ig n ,(*fill).s to re.s iz e
= ...
7: as s ig n ,(*fill).n tab s =
n u m
4: as s ig n ,(*fill).s to re.b u f =
(ch ar*) ...
0: as s ig n ,(*field ).field s =
tab = ...
10: as s ig n , p b =(u n s ig n ed
ch ar*) (*fill)...
11: dec l.,c har* pb
5: d ec l.,s tru c t field *
tab s
1: as s ig n ,tab s = (s tru c t
...
2: c all-s ite,xmalloc ()
6: c all-s ite,xmalloc ()
16: dec l.,int j
17: ass ign,j = 0
18: inc .,j++
19: c ontro lj < idx
PDG for the Original Code PDG for the Plagiarized Code
![Page 21: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/21.jpg)
21
PDG-based Plagiarism Detection
A program is represented as a set of PDGs Let g be a PDG of Procedure P in the original program Let g’ be a PDG of Procedure P’ in the plagiarism suspect
Subgraph isomorphism implies plagiarism If g is subgraph isomorphic to g’, P’ is likely plagiarized
from P γ-isomorphism: Graph g is γ-isomorphic to g’ if there
exists a subgraph s of g such that s is subgraph isomorphic to g’, and |s|≥ γ |g|.
If g is γ–isomorphic to g’, the PDG pair (g, g’) is regarded as a plagiarized PDG pair, and is then returned to human beings for examination.
![Page 22: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/22.jpg)
22
Advantages
Robust because it is hard to overhaul PDGs Dependencies encode program logic Incentive of plagiarism
![Page 23: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/23.jpg)
23
Outline
Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions
![Page 24: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/24.jpg)
24
Efficiency and Scalability
Search space If the original program has n procedures and
the plagiarism suspect has m procedures n*m subgraph isomorphism testings
Pruning search space Lossless filter Statistical lossy filter
![Page 25: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/25.jpg)
25
Lossless filter
Interestingness PDGs smaller than an interesting
size K are excluded from both sides
γ-isomorphism definition A PDG pair (g, g’) is discarded if |
g’| <γ|g|.
![Page 26: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/26.jpg)
26
Lossy Filter
Observation If procedure P’ is plagiarized from
procedure P, its PDG g’ should look similar to g.
So discard those dissimilar PDG pairs Requirement
This filter must be light-weighted
![Page 27: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/27.jpg)
27
Vertex Histogram
Represent PDG g byh(g) = (n1, n2, …, nk),
where ni is the frequency of the ith kind of vertices.
Similarly, represent PDG g’ byh(g’) = (m1, m2, …, mk).
Direct similarity measurement? How to define a proper similarity threshold? Is thus defined threshold program-independent?
![Page 28: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/28.jpg)
28
Hypothesis Testing-based Approach
Basic idea Estimate a k-dimensional multinomial
distribution from h(g) Test whether h(g’) is likely an
observation from If it is, g’ looks similar to g, and an
isomorphism testing is needed. Otherwise, (g, g’) is discarded
![Page 29: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/29.jpg)
29
Technical Details
![Page 30: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/30.jpg)
30
Technical Details (cont’d)
![Page 31: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/31.jpg)
31
Work-flow of GPLAG
PDGs are generated with Codesurfer
Isomorphism testing is implemented with VFLib.
![Page 32: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/32.jpg)
32
Outline
Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions
![Page 33: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/33.jpg)
33
Experiment Design
Subject programs
Effectiveness Filter efficiency Core-part plagiarism detection
![Page 34: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/34.jpg)
34
Effectiveness
2-hour manual plagiarism, but can be automated? GPLAG detects all plagiarized PDG pairs within 1 second PDG isomorphism also reveals what plagiarism disguises are applied
![Page 35: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/35.jpg)
35
Efficiency
Subject programs bc, less and tar. Exact copy as plagiarism.
Lossless and lossy filter Pruning PDG-pairs. Implication to overall time cost.
![Page 36: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/36.jpg)
36
Pruning Uninteresting PDG-pairs
Lossless only Lossless and
lossy
![Page 37: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/37.jpg)
37
Implication to Overall Time Cost
Time-out for subgraph isomorphism testing, time hogs.
Lossless filter does not save much time.
Lossy filter significantly reduces the time cost.
Major time saving comes from the avoidance of time hogs.
![Page 38: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/38.jpg)
38
Detection of Core-part Plagiarism
Lower time cost with lossy filter. Lower false positives with lossy filter.
![Page 39: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/39.jpg)
39
Outline
Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions
![Page 40: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/40.jpg)
40
Conclusions
We developed a new algorithm GPLAG for software plagiarism detection
It is more effective to fight against “professional” plagiarists
We developed a statistical lossy filter, which improves the efficiency of GPLAG
We experimentally verified the effectiveness and efficiency of GPLAG
![Page 41: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/41.jpg)
41
Q & A
Thank You!
![Page 42: 1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at](https://reader036.vdocuments.us/reader036/viewer/2022081518/5514752b550346494e8b6278/html5/thumbnails/42.jpg)
42
References[1] B. S. Baker. On finding duplication and near duplication in large software
systems. In Proc. of 2nd Working Conf. on Reverse Engineering, 1995.[2] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. Clone detection
using abstract syntax trees. In Proc. of Int. Conf. on Software Maintenance, 1998.
[3] K. Kontogiannis, M. Galler, and R. DeMori. Detecting code similarity using patterns. In Working Notes of 3rd Workshop on AI and Software Engineering, 1995.
[4] T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng., 28(7), 2002.
[5] L. Prechelt, G. Malpohl, and M. Philippsen. Finding plagiarisms among a set of programs with JPlag. J. of Universal Computer Science, 8(11), 2002.
[6] S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. SIGMOD, 2003.
[7] V. B. Livshits and T. Zimmermann. Dynamine: Finding common error patterns by mining software revision histories. In Proc. of 13th Int. Symp. on the Foundations of Software Engineering, 2005.
[8] C. Liu, X. Yan, and J. Han. Mining control flow abnormality for logic error isolation. In In Proc. 2006 SIAM Int. Conf. on Data Mining, 2006.
[9] C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu. Mining behavior graphs for ”backtrace” of noncrashing bugs. In SDM, 2005.