foto afrati — national technical university of athens anish das sarma — clearlist inc

28
Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc. Anand Rajaraman — Cambrian Ventures Pokey Rule — Stanford University Semih Salihoglu — Stanford University Jeff Ullman — Stanford University Anchor Points Algorithms for Hamming and Edit Distance 1

Upload: truly

Post on 23-Feb-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Anchor Points Algorithms for Hamming and Edit Distance. Foto Afrati — National Technical University of Athens Anish Das Sarma — ClearList Inc . Anand Rajaraman — Cambrian Ventures Pokey Rule — Stanford University Semih Salihoglu — Stanford University - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

1

Foto Afrati — National Technical University of Athens

Anish Das Sarma — ClearList Inc.Anand Rajaraman — Cambrian Ventures

Pokey Rule — Stanford UniversitySemih Salihoglu — Stanford University

Jeff Ullman — Stanford University

Anchor Points Algorithms for Hamming and Edit Distance

Page 2: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

Fuzzy Joins

2

Input: set of records ROutput: <reci, recj> pairs s.t. dist(reci, recj) ≤ d

rec1

rec2

…recm

Input Output<rec1, rec5><rec7, rec9>

…<rec3, reck>

Example Applications: entity resolution, clustering, collaborative filtering

Page 3: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

Two Specific Distance Measures

3

1. Hamming Distance Input: bit strings R of length n

2. Edit Distance Input: strings R of length n over alphabet A

0000000001

…10011

<00000, 00001>

…<10011, 10010>

abcd

eabc…

dddd

<abcd, eabc>

…<dddd, dadd>

Page 4: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

Fuzzy Joins In One-Round MapReduce

4

rec1

rec2

rec3

recm-1

recm

Map

values

rec1, rec5, rec7

rec2, rec7, recm

rec2, recm

Reduce

key

reducer1

reducer2

reducerp

Per-Reducer-Memory-Cost

Communication Cost

Page 5: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

5

communication

|R|=2n

2 |R|=2n

Grouping

(naïve)

per-reducer memory

22n

2n-d+1

Ball-Hashing

O(nd/2)

Splitting

Communication Cost vs Per-reducer Memory

Anchor Points

Page 6: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

Outline

6

1. Anchor Points Algorithm

• Covering Code

2. Explicit Construction of Hamming Distance Covering

Codes

3. Explicit Construction of Edit Distance Covering Codes

Page 7: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

Outline

7

1. Anchor Points Algorithm

• Covering Code

2. Explicit Construction of Hamming Distance Codes

3. Explicit Construction of Edit Distance Codes

Page 8: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

Covering Code

8

Given set of strings R of length n, and radius k Definition: <n, k> covering code C

for each s∈R, there is a c∈C, s.t dist(c, s) ≤ k

kn length of stringsd distance of pairsk radius of code

Page 9: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

Example Covering Code

9

01111 … 11101 11110

00111 … 10011 … 11100

00011 00101 … 10001 11000

00001 … 01000 10000

Example: Hamming Distance, n=5, k = 2

… …

……

… …

…n length of stringsd distance of pairsk radius of code

11111

00000

R

Page 10: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

10

00000010000101101100

…1111011111

Map Reduce

Let C be an <n, k> covering code => (e.g. n=5, k=2)One reducer for each code wordMap s to code words at distance ≤ k + d/2 => (e.g. d=2 => 2 + 2/2 = 3)

Anchor Points Algorithm (1)

r00000

r11111

Page 11: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

11

Anchor Points Algorithm (2)

≤d/2

c

v≤d

≤k

u

w≤d/2

≤k + d/2≤k + d/2

Triangle Inequality

n length of stringsd distance of pairsk radius of code

Page 12: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

12

Cost of Anchor Points Algorithm

B(n, r): size of the ball of radius rPer-reducer memory: B(n, k + d/2)Communication: |C|B(n, k + d/2)

Reducer for code word c

c

k + d/2s4

s7 s6

s9

s17

s11

s5

s1

n length of stringsd distance of pairsk radius of code

Page 13: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

13

communication

|R|=2n

2 |R|=2n

Groupin

g (naïve)

per-reducer memory

22n

2n-d+1

Ball-Hashing

O(nd/2)

Splitting

Anchor Pointsk=0

k=1

k=2

k=n

n length of stringsd distance of pairsk radius of code

Communication Cost vs Per-reducer Memory

Page 14: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

Outline

14

1. Anchor Points Algorithm

• Covering Code

2. Explicit Construction of Hamming Distance Codes

3. Explicit Construction of Edit Distance Codes

Page 15: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

Some Known Hamming Distance Codes

15

k n |C|0 any 2n

n any 11 n=2r-1 2n/n+1

Perfect <n, k> Code (i.e., smallest possible) : 2n/B(n, k)

Hamming Codes

n length of stringsd distance of pairsk radius of code

For any k: existence of n2n/B(n, k) => not Perfect Problem: no explicit construction

Page 16: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

16

Cross Product Method (Explicit HD <n, k> Codes)Start with <n/t, k/t> code DLet C = D x D x … x D (t times)Claim: C is a <n, k> covering codeProof:

s = s1 s2 s3 … st

c = d1 d2 d3 … dt

≤k/t ≤k/t ≤k/t ≤k/tdist(s, c) ≤ k

n length of stringsd distance of pairsk radius of code

Page 17: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

Example of Cross Product Methodn = 10, k = 4, t=2 => use a <5, 2>

code D D = {00000, 11111}

17

00000--11111

11111--11111

11111--00000

00000--00000

1100011100≤2+2

=4

1110000001

≤2+1=3

11000--11100

11100--00001

n length of stringsd distance of pairsk radius of code

Page 18: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

Size of Cross Product Codes: Dk

Assume D is perfect (e.g., Hamming code)

18

Perfect <n, k> code:

For large n, small t => same asymptotic size

Example: n, k=2, t=2

vs

Page 19: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

Outline

19

1. Anchor Points Algorithm

• Covering Code

2. Explicit Construction of Hamming Distance Covering

Codes

3. Explicit Construction of Edit Distance Covering Codes

Page 20: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

Edit Distance Fuzzy Joins

20

abcd

eabc

cadb…

dadd

dddd

<abcd, eabc>

…<dddd, dadd>

Input Output

strings of length n over alphabet A (i.e.,|A|n strings)

Covering codes algorithm works in the same way: If C is a <n, k> edit distance code Send s to all code words at distance k+d/2

Page 21: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

Differences with Hamming Distance

21

1. Length of code words might be different E.g. 1 insertion, |c| = n+1 => insertion-1 code E.g. 1 deletion, |c| = n-1 => deletion-1 code

2. Different code words might have different ball sizes

3. No known perfect codes or explicit construction

ababa…a(n+1)

aaba…a(n) abba…a

(n)

abaa…a(n)

baba…a(n)

… …

aaaaa…a(n+1)

aaaa..a(n)

Page 22: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

Insertion-1 Codes

22

Let n=5, |A|=a=4, code words are of length 6Letters as integers from 0 to (a-1): e.g. 0230, 1124, …Let si be the ith digit of s1. sum(s) = 2. score(s) = sum(s) % (n+1)(a-1) (e.g., 6*3=18)3. R = Any a-1 consecutive residues:

e.g. {0,1,2}, {12,13,14}, {16,17,0}C = {003000, 303000, 003001, 003002, 200000, …}

|C| =

**factor a worse than best possible**

Page 23: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

Example: s=23010, sum(s)=24, score(s)=6

23

sum 23 24 25 26 27 28

29

30

31

32

33

34

35

36

37

38 39 40 41 42 43

score 5 6 7 8 9 10

11

12

13

14

15

16

17

0 1 2 3 4 5 6 7

X 023010 Y323010

sum 23 24 25 26 27 28

29

30

31

32

33

34

35

36

37

38 39 40 41 42 43

score 5 6 7 8 9 10

11

12

13

14

15

16

17

0 1 2 3 4 5 6 7

X 203010 Y233010

sum 23 24 25 26 27 28

29

30

31

32

33

34

35

36

37

38 39 40 41 42 43

score 5 6 7 8 9 10

11

12

13

14

15

16

17

0 1 2 3 4 5 6 7

X 230010 Y233010

sum 23 24 25 26 27 28

29

30

31

32

33

34

35

36

37

38 39 40 41 42 43

score 5 6 7 8 9 10

11

12

13

14

15

16

17

0 1 2 3 4 5 6 7

X 230010 Y230310

sum 23 24 25 26 27 28

29

30

31

32

33

34

35

36

37

38 39 40 41 42 43

score 5 6 7 8 9 10

11

12

13

14

15

16

17

0 1 2 3 4 5 6 7

X 230100 Y230130

Page 24: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

Edit Distance Codes

24

Insertion/Deletion Size Explicit/Existence

Insertion-1 explicit

Deletion-1 explicit

Deletion-2 explicit

Deletion-1 existence

Page 25: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

Summary

25

1. Fuzzy Joins for Hamming and Edit Distance in One-round MR

2. Anchor Points Algorithm Covering Codes Flexible parallelism Better communication cost than naive

3. Explicit construction of Hamming distance covering codes

4. Explicit Construction of Edit distance covering codes

Page 26: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

Open Questions

26

Fuzzy Joins in MR Minimum communication for a given per-reducer

memory for 1 round MR algorithms? Know the answer for only Hamming Distance 1

How about multi-round MR algorithms? Covering Codes

Are there smaller codes? Can we construct smaller codes explicitly? What is the size of the smallest codes?

Page 27: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

Related Work

27

Fuzzy Joins in MR Fuzzy Joins Using MapReduce, Afrati et. al., ICDE 2012 Document Similarity Self-Join with MapReduce, Baraglia et. al.,

ICDM 2010 Efficient Parallel Set-similarity Joins Using MapReduce, Vernica

et. al., SIGMOD 2010 Efficient Similarity Joins for Near Duplicate Detection, Xiao et.

al., WWW 2008Covering Codes

Covering codes, Gary Cohen On Asymmetric Coverings and Covering Numbers, Applegate

et. al., Comb. Designs 2003 Asymmetric Binary Covering Codes, Cooper et. al., Comb.

Theory 2002

Page 28: Foto Afrati —  National Technical University of Athens Anish  Das  Sarma —  ClearList  Inc

28

Questions?