introduction algorithm framework future work demo

24
Copy or Not Dawei (David) Shi

Post on 20-Dec-2015

222 views

Category:

Documents


2 download

TRANSCRIPT

Copy or NotDawei (David) Shi

Copy Or Not

Introduction Algorithm Framework Future work Demo

Copy Or Not

Introduction Algorithm Framework Future work Demo

Introduction

A web-based document comparator Calculate accurate similarity between 2

documents

Copy Or Not

Introduction Algorithm Framework Future work Demo

Algorithm

Preprocessing Vector space Similarity calculation

Preprocessing

LowercaseStop

words filtering

Stemming

Preprocessing

Stemming› Porter Stemming Algorithm› E.g.

cat – cats meet – meeting agree – agreed correct - correctness

Vector Space

Build dictionary 1› word -> frequency

Sort the keys of dictionary 1 Build dictionary 2

› key -> (index, count) Build binary vectors

› index -> occurrence

Similarity Calculation

Vectors v1 and v2 Similarity = v1 * v2 / (norm(v1) *

norm(v2))

Performance

Algorithms coded in Python› Dynamic typing› Not good at numerical operations

Solution: numpy

Numpy

A Python extension module Written mostly in C Define numerical array and matrix

types and basic operations on them

Numpy vs Python

Python code› a = range(10000000)› b = range(10000000)› c = []› for i in range(len(a)):

c.append(a[i] + b[i]) Takes up to 10 seconds on a several

GHz processor

Numpy vs Python

Numpy code› import numpy as np› a = np.arrange(10000000)› a = np.arrange(10000000)› c = a + b

Almost Instant

Numpy Usage

Vector dot product Vector normalization Vector zero filling

Copy Or Not

Introduction Algorithm Framework Future work Demo

Framework

Django› The web framework for perfectionists with

deadlines

Libraries

Python› Numpy› Porter Stemming

jQuery

Hosting

Alwaysdata› Django 1.3› Python 2.6

Copy Or Not

Introduction Algorithm Framework Future work Demo

Future Work

Support file uploading and comparison Add HTML5 features

Copy Or Not

Introduction Algorithm Framework Future work Demo

Demo

http://imds.alwaysdata.net

Thank you!