plagiarism monitoring and detection -- towards an open discussion

24
Plagiarism Monitoring Plagiarism Monitoring and Detection -- and Detection -- Towards an Open Towards an Open Discussion Discussion Edward L. Jones Computer Information Sciences Florida A & M University Tallahassee, Florida

Upload: aglaia

Post on 20-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Plagiarism Monitoring and Detection -- Towards an Open Discussion. Edward L. Jones Computer Information Sciences Florida A & M University Tallahassee, Florida. Outline. What is Plagiarism, and Why Address It Plagiarism Detection & Countermeasures A Metrics-Based Detection Approach - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Plagiarism Monitoring and Detection -- Towards an Open Discussion

Plagiarism Monitoring and Plagiarism Monitoring and Detection -- Towards an Open Detection -- Towards an Open

DiscussionDiscussion

Edward L. JonesComputer Information Sciences

Florida A & M University

Tallahassee, Florida

Page 2: Plagiarism Monitoring and Detection -- Towards an Open Discussion

OutlineOutline

What is Plagiarism, and Why Address ItPlagiarism Detection &

CountermeasuresA Metrics-Based Detection ApproachExtending the ApproachConclusions & Future Work

Page 3: Plagiarism Monitoring and Detection -- Towards an Open Discussion

Why Tackle Plagiarism?Why Tackle Plagiarism?

Plagiarism undermines educational objectives

Failure to address sends wrong message

A non-contrived ethical issue in computing

Plagiarism is hard to define

Plagiarism is costly to pursue/prosecute

An interesting problem for tinkering

Page 4: Plagiarism Monitoring and Detection -- Towards an Open Discussion

What is Plagiarism?What is Plagiarism?

“use of another’s ideas, writings or inventions as

one’s own” (Oxford American Dictionary, 1980)

Shades of Gray

– Theft of work

– Gift of work

– Collusion

– Collaboration

– Coincidence

Intent to Deceive

Page 5: Plagiarism Monitoring and Detection -- Towards an Open Discussion

How is it Detected?How is it Detected?

By chance

– Anomalies

– Temporal proximity when grading

Automation methods

– Direct text comparison (Unix diff)

– Lexical pattern recognition

– Structural pattern recognition

– Numeric profiling

Page 6: Plagiarism Monitoring and Detection -- Towards an Open Discussion

Plagiarism Concealment Plagiarism Concealment

TacticsTactics

None

Change comments

Change formatting

Rename identifiers

Change data types

Reorder blocks

Reorder statements

Reorder expressions

Superfluous code

Alternative control

structures

Page 7: Plagiarism Monitoring and Detection -- Towards an Open Discussion

Prosecution -- DA in the Prosecution -- DA in the

House?House? Course syllabus broaches the subject

– Concrete definition generally lacking

– Sense of “we’ll know it when we see it”

N? Tolererance Policy

Investigation Stage

Prosecution Stage

Missed opportunity to teach?

Page 8: Plagiarism Monitoring and Detection -- Towards an Open Discussion

An Awareness ApproachAn Awareness Approach Monitor closeness of student programs

– Objective measures

– Automated

Post anonymous closeness results in public

– Nonconfrontational awareness

– “A word to the wise … “

Benchmark student behavior

– Establishing thresholds

– Effects of course, language

Page 9: Plagiarism Monitoring and Detection -- Towards an Open Discussion

Program 2Program 2

Program 1Program 1

( lines1, words1, characters1

Closeness Measures -- Closeness Measures -- PhysicalPhysical

( lines2, words2, characters2)

Euclidean Distance

Page 10: Plagiarism Monitoring and Detection -- Towards an Open Discussion

Program 2Program 2

Program 1Program 1

( length1, vocabulary1, volume1)

Closeness Measures -- Closeness Measures -- HalsteadHalstead

( length2, vocabulary2, volume2)

Euclidean Distance

Page 11: Plagiarism Monitoring and Detection -- Towards an Open Discussion

Comparison of MeasuresComparison of Measures Physical profile ==> weight test

– Simple/cheap to compute (Unix wc command)

– Sensitive to character variations

Halstead profile ==> content test

– More complex/expensive to compute

– Ignores comments and white space

– Sensitive only to changes in program content

Detection effectiveness vs. plagiarism tactic

Page 12: Plagiarism Monitoring and Detection -- Towards an Open Discussion

Closeness ComputationCloseness Computation Normalization

– Establish upper bound for comparison (1.414)

– Distance computed on normalized (unit) vectors

Normalization I -- Self normalization

– p = (a, b, c) ==> (a/L, b/L, c/L)

– Largest component dominates

Normalization II -- Global scaling

– p = (a, b, c) ==> q = (a/aMAX, b/bMAX, c/cMAX)

– Self normalization applied to q

Page 13: Plagiarism Monitoring and Detection -- Towards an Open Discussion

Distribution Of Closeness Distribution Of Closeness ValuesValues

Figure 2. Distribution of Halstead Closeness

0.000000.005000.010000.015000.020000.025000.030000.035000.040000.04500

0 100 200 300 400 500

Student Program Pairs

Clo

sen

ess

Mea

sure

Figure 2. Distribution of Halstead Closeness

0.000000.005000.010000.015000.020000.025000.030000.035000.040000.04500

0 100 200 300 400 500

Student Program Pairs

Clo

sen

ess

Mea

sure

Page 14: Plagiarism Monitoring and Detection -- Towards an Open Discussion

Comparison of ProfilesComparison of Profiles

Page 15: Plagiarism Monitoring and Detection -- Towards an Open Discussion

Closeness DistributionCloseness Distribution

Closeness values vary by assignment Programming language may lead clustering at

the lower end of the spectrum Reuse of modules leads to cluster ingat the

lower end of the spectrum No a priori threshold pin-pointing plagiarism All measures exhibit these behaviors

Page 16: Plagiarism Monitoring and Detection -- Towards an Open Discussion

Suspect IdentificationSuspect IdentificationCollaboration Suspects (5-th Percentile)

Rank Closeness student1 student21 0.00000000 alpha alpha2 0.00000652 alpha beta3 0.00026963 beta gamma4 0.00026981 alpha gamma5 0.00031262 gamma epsilon6 0.00048815 sigma delta7 0.00049825 alpha epsilon8 0.00050169 beta epsilon9 0.00066481 gamma theta

10 0.00073158 beta theta

Page 17: Plagiarism Monitoring and Detection -- Towards an Open Discussion

Independence IndexIndependence IndexStudent Independence Indices

Index student1

1 alpha 2 beta

3 gamma5 epsilon6 sigma6 delta9 theta

Index = position at which student debuts on Closeness List

Page 18: Plagiarism Monitoring and Detection -- Towards an Open Discussion

Preponderance of EvidencePreponderance of Evidence Historical Record of Student Behavior

– Collaboration/partnering

– Independence indices

Profile and analyze other artifacts

– Compilation logs

– Execution logs

Page 19: Plagiarism Monitoring and Detection -- Towards an Open Discussion

Another ApproachAnother Approach

Make student demonstrate familiarity with

submitted program

– Seed errors into program

– Time limit for removing error and resubmitting

Holistic approach

– Intentional, not accidental

Page 20: Plagiarism Monitoring and Detection -- Towards an Open Discussion

ConclusionsConclusions

We can do something about plagiarism -- the first step is to develop eyes and ears

Simple metrics appear to be adequateTools are essentialSophistication is not as necessary as

automationStudents are curious to know how they

compare with other students

Page 21: Plagiarism Monitoring and Detection -- Towards an Open Discussion

On-Going & Future WorkOn-Going & Future Work

Complete the toolset– Student Independence Index

Incorporate other Artifacts– Compilation logs– Execution logs

Integrate into Automated GradingDisseminate Results

– Package tool as shareware

Page 22: Plagiarism Monitoring and Detection -- Towards an Open Discussion

Questions?

Questions?

Questions?

Page 23: Plagiarism Monitoring and Detection -- Towards an Open Discussion

Thank You

Page 24: Plagiarism Monitoring and Detection -- Towards an Open Discussion

Flow ChartFlow Chart

Student Programs

ProfileComputeCloseness

SuspiciousPrograms