Copy or NotDawei (David) Shi
Copy Or Not
Introduction Algorithm Framework Future work Demo
Copy Or Not
Introduction Algorithm Framework Future work Demo
Introduction
A web-based document comparator Calculate accurate similarity between 2
documents
Copy Or Not
Introduction Algorithm Framework Future work Demo
Algorithm
Preprocessing Vector space Similarity calculation
Preprocessing
LowercaseStop
words filtering
Stemming
Preprocessing
Stemming› Porter Stemming Algorithm› E.g.
cat – cats meet – meeting agree – agreed correct - correctness
Vector Space
Build dictionary 1› word -> frequency
Sort the keys of dictionary 1 Build dictionary 2
› key -> (index, count) Build binary vectors
› index -> occurrence
Similarity Calculation
Vectors v1 and v2 Similarity = v1 * v2 / (norm(v1) *
norm(v2))
Performance
Algorithms coded in Python› Dynamic typing› Not good at numerical operations
Solution: numpy
Numpy
A Python extension module Written mostly in C Define numerical array and matrix
types and basic operations on them
Numpy vs Python
Python code› a = range(10000000)› b = range(10000000)› c = []› for i in range(len(a)):
c.append(a[i] + b[i]) Takes up to 10 seconds on a several
GHz processor
Numpy vs Python
Numpy code› import numpy as np› a = np.arrange(10000000)› a = np.arrange(10000000)› c = a + b
Almost Instant
Numpy Usage
Vector dot product Vector normalization Vector zero filling
Copy Or Not
Introduction Algorithm Framework Future work Demo
Framework
Django› The web framework for perfectionists with
deadlines
Libraries
Python› Numpy› Porter Stemming
jQuery
Hosting
Alwaysdata› Django 1.3› Python 2.6
Copy Or Not
Introduction Algorithm Framework Future work Demo
Future Work
Support file uploading and comparison Add HTML5 features
Copy Or Not
Introduction Algorithm Framework Future work Demo
Demo
http://imds.alwaysdata.net