codesimian cs491b – andrew weng. motivation academic integrity is a universal issue plagiarism is...
Post on 19-Dec-2015
216 views
TRANSCRIPT
CodeSimianCS491B – Andrew Weng
Motivation
• Academic integrity is a universal issue
• Plagiarism is still common today• Kaavya Viswanathan (Harvard Student)
• Book contains many plagiarized passages
• Yoshihiko Wada (Painter, Japan)• Artwork plagiarized from Alberto Sughi
• Scott D. Miller (Wesley College President)• Plagiarized material found on his website
Is Plagiarism Harmful?
• Who does plagiarism really hurt?• The student• The class• The University
• Plagiarism is not only concerned with the protection of intellectual property rights
Plagiarism Detection
Benefits of Utilizing Plagiarism Detection
• Prevention
• Enforcement
• Objective standpoint
Platform Overview
• Developed on Visual Studio .NET 2005• Coded in Microsoft Visual C# .NET• Windows Forms application• Simple and familiar GUI (Windows)
• Intended focus is ease of use
Theoretical Overview
CodeSimian is based on two primary principles
• Kolmogorov Complexity
• Information Distance
Kolmogorov Complexity
• Simple definition: The shortest length program that can be written on a universal Turing machine to produce a specified output
• Purely theoretical
• Impossible to calculate exactly
Kolmogorov Complexity
Define x to be a desired output string
K(x) = The length of the program that produces x
K(x|y) = The length of the program that produces x given y as an input
K(xy) = The length of the program that produces x concatenated with y
Kolmogorov Complexity
Compare two infinitely long numbers π and a randomly generated number between 0 and 1:
π =3.1415926535897932384626433832795…
n = 0.5234958723957329875320935293853…
K(π) is a small and finite number, which represents the code required to generate the value of π to an infinite
Kolmogorov Complexity
π =3.1415926535897932384626433832795…
K(π) is a small and finite number, which represents the code required to generate the value of π to an infinite
Perhaps something as simple as the implementation of Leibniz’s formula:
...11
1
9
1
7
1
5
1
3
1
1
14
12
14
0n
n
n
Kolmogorov Complexity
n = 0.5234958723957329875320935293853…
In order to generate the full output of a truly random number n, the length of the program would be infinitely long.
The code would essentially be System.out.println(“0.52349587…”);
Kolmogorov Complexity
So how does this apply to plagiarism detection?
Define x = π and y = π/4
K(x|y) would be a very small value. Given y, one can calculate the result of π with a simple multiplier.
Information Distance
The distance (or difference) between two objects
Formula used:
)(
)|()(1),(
xyK
yxKxKyxd
Information Distance
• Similarity Factor
If we remove the amount of information contained in x by y, and we normalize the number by the amount of information in both x and y, we can obtain a percentage of similarity
)(
)|()(),(
xyK
yxKxKyxs
Implementation
What does CodeSimian do to obtain the similarity factors?
1. Parse and Tokenize the code
2. Compress the tokenized strings
3. Compare the compressed strings
Parsing the Code
• Utilized ANTLR to parse and tokenize the code
• ANTLR, ANother Tool for Language Recognition, (formerly PCCTS) is a language tool that provides a framework for constructing recognizers, compilers, and translators from grammatical descriptions containing Java, C#, C++, or Python actions. (www.antlr.org)
Tokenizing the Code
• The tokenized output is a string of characters, each of which represents a token within the code
• For Example:
{ int c = 0; } contains 7 “letters”
Open Bracket
Integer type declaration
Variable name
Assignment operator
Integer Value
Statement end
Close Bracket
Compressing the String
This string is then compressed using a Lempel-Ziv compression algorithm with unbounded buffers
• As the string is being read, a library is generated as it progresses.
• When repeats are detected, it utilizes pointers to the library to recreate the required section
Compressing the String
• Normally limitations exist on library size and the “word” length stored
• Memory utilization and efficiency is not important
• Lempel-Ziv is suitable for this application
Comparing the Compressed String
• K(x) is the size of the compressed and tokenized code x.
• K(x|y) is the size of the compressed and tokenized code x, given y as a “free” library
• K(xy) is the size of the compressed and tokenized code x+y.
Results
Using the test on trivial examples:• LinkedList.java• LinkedList2.java• LinkedList3.java
• Changes included only variable names, reformatting, removing comments, rearranging variable declaration, adding “junk” code, such as random debugging text output.
• All files came out as >85% similar
Results
Using the test on a small real-world sample
Professor Kang’s CS201 HW1
• Relatively simple homework assignment
• 30-50% similarity average
• 95% similarity detected on one pair of submissions
• Confirmed by Professor Kang as correct
Results
Using the test on another small real-world sample
Professor Kang’s CS201 HW4• More complex homework assignment involving 2-3
files; break down of java files according to function• Problem being that specialized function files may
possible present false positives?• 30-70% similarity average• 95+% similarity detected on pairs of submissions• Confirmed by Professor Kang as correct
Results
• Things to note…
• The results showed a similarity of 80% on one pair of results, which is deemed significant by the application but necessarily conclusive
• Careful inspection by hand of the suspected files revealed one block of code that was apparently copied with variable name changes
Conclusions
• Successful test cases
• Simple and straightforward to use
• Based on an objective principle which works!
Future Work
• Enhancing the application to be able to compare internal “blocks” of code
• Improving the compression algorithm to better handle and adapt to “approximate matches”
• Improving the functionality with the GUI
• Providing a report printing capability of directories