efficient approximate search on string collections part i marios hadjieleftheriou chen li 1

61
Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Upload: amari-kidd

Post on 02-Apr-2015

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Efficient Approximate Search on String CollectionsPart I

Marios Hadjieleftheriou Chen Li

1

Page 2: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

DBLP Author Search

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html 2

Page 3: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Try their names (good luck!)

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html 3

Yannis Papakonstantinou Meral Ozsoyoglu Marios Hadjieleftheriou

UCSD Case Western AT&T--Research

Page 4: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

4

Page 5: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Better system?

5

http://dblp.ics.uci.edu/authors/

Page 6: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

People Search at UC Irvine

6

http://psearch.ics.uci.edu/

Page 7: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Web Search

http://www.google.com/jobs/britney.html

Errors in queries

Errors in data

Bring query and meaningful

results closer togetherActual queries gathered by Google

7

Page 8: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Data Cleaning

R

informix

microsoft

S

infromix

…mcrosoft

8

Page 9: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Problem Formulation

Find strings similar to a given string: dist(Q,D) <= δ

Example: find strings similar to “hadjeleftheriou”

Performance is important!-10 ms: 100 queries per second (QPS)- 5 ms: 200 QPS

9

Page 10: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Outline

Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion

10

Part I

Part II

Page 11: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Preliminaries

11

Next…

Page 12: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Similarity Functions Similar to:

a domain-specific function returns a similarity value between two strings

Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE See [KSS06] for an excellent survey

12

Page 13: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

13

A widely used metric to define string similarityEd(s1,s2) = minimum # of operations (insertion,

deletion, substitution) to change s1 to s2Example:

s1: Tom Hanks

s2: Ton Hank

ed(s1,s2) = 2

Edit Distance

13

Page 14: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

State-of-the-art: Oracle 10g and older versions Supported by Oracle Text CREATE TABLE engdict(word VARCHAR(20), len INT); Create preferences for text indexing:

begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; /

CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF');

Usage:

SELECT * FROM engdict

WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0;

Limitation: cannot handle errors in the first letters:

Katherine versus Catherine

14

Page 15: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

15

Microsoft SQL Server [CGG+05]

Data cleaning tools available in SQL Server 2005 Part of Integration Services Supports fuzzy lookups Uses data flow pipeline of transformations Similarity function: tokens with TF/IDF scores

15

Page 16: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Lucene Using Levenshtein Distance (Edit Distance). Example: roam~0.8 Prefix pruning followed by a scan

(Efficiency?)

16

Page 17: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Trie-based approach [JLL+09]

17

Next…

Page 18: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Trie Indexing

e

x

a

m

p

l

$

$

e

m

p

l

a

r

$

t

$

s

a

m

p

l

e

$e

Strings

exam

example

exemplar

exempt

sample

18

Page 19: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Active nodes on Trie

e

x

a

m

p

l

$

$

e

m

p

l

a

r

$

t

$

s

a

m

p

l

e

$e

Prefix Distance

examp 2

exampl 1

example 0

exempl 2

exempla 2

sample 2

Query: “example”

Edit-distance threshold = 2

2

1

0

2

2

2

19

Page 20: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Initialization

e

x

a

m

p

l

$

$

e

m

p

l

a

r

$

t

$

s

a

m

p

l

e

$e

Q = ε 0

1 1

2 2

Prefix DistancePrefix Distance

0

e 1

ex 2

s 1

sa 2

Prefix Distance

ε 0

Initial active nodes: all nodes within depth δ

20

Page 21: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Incremental Algorithm

Return leaf nodes as answers.21

Page 22: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

e

Q = e x a m p l e

e

x

a

m

p

l

$

$

e

m

p

l

a

r

$

t

$

s

a

m

p

l

e

$

Prefix Distance

ε 0

e 1

ex 2

s 1

sa 2

Prefix # Op Base Op

ε 1 ε del e

s 1 ε sub e/s

e 0 ε mat e

ex 1 ε ins x

exa 2 ε Ins xa

exe 2 ε Ins xe

Prefix # Op Base OpPrefix # Op Base Op

ε 1 ε del e

Prefix # Op Base Op

ε 1 ε del e

s 1 ε sub e/s

Prefix # Op Base Op

ε 1 ε del e

s 1 ε sub e/s

e 0 ε mat e

1

10

1

2 2

e 2 e del e

ex 2 e sub e/xex 3 ex del e

exa 3 ex sub e/a

exe 2 ex mat e

s 2 s del e

sa 2 s sub e/asa 3 sa del e

Active nodes for Q

=

ε

Active nodes for Q

=

e

2

22

Page 23: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

23

Advantages: Trie size is small Can do search as the user types

DisadvantagesWorks for edit distance only

Good and bad

23

Page 24: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Gram-based algorithmsList-merging algorithms [LLL08]Variable-length grams (VGRAM)

[LWY07,YWL08]

24

Next…

Page 25: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

“q-grams” of strings

u n i v e r s a l

2-grams

25

Page 26: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Edit operation’s effect on grams

k operations could affect k * q grams

u n i v e r s a lFixed length: q

26

If ed(s1,s2) <= k, then their # of common grams >=

(|s1|- q + 1) – k * q

Page 27: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

q-gram inverted lists

id strings01234

richstickstichstuckstatic

4

2 30

1 4

2-grams

atchckicristtatituuc

201 30 1 2 4

41 2 433 27

Page 28: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

# of common grams >= 3

Searching using inverted lists Query: “shtick”, ED(shtick, ?)≤1

id strings01234

richstickstichstuckstatic

2-grams

atchckicristtatituuc

4

2 30

1 4

201 30 1 2 4

41 2 433

ti ic cksh ht ti ic ck

28

Page 29: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

T-occurrence Problem

Find elements whose occurrences ≥ T

Ascending

order

Ascending

order

MergeMerge

29

Page 30: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Example T = 4

Result: 13

1

3

5

10

13

10

13

15

5

7

13

13 15

30

Page 31: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

List-Merging Algorithms

HeapMerger MergeOpt

[SK04]

[LLL08, BK02]

ScanCount MergeSkip DivideSkip

31

Page 32: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Heap-based Algorithm

Min-heap

Count # of occurrences of each element using a heap

Push to heap ……

32

Page 33: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

MergeOpt Algorithm [SK04]

Long Lists: T-1 Short Lists

Binary

search

33

Page 34: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Example of MergeOpt

1

3

5

10

13

10

13

15

5

7

13

13 15

Count threshold T≥ 4

Long Lists: 3Short Lists: 2

34

Page 35: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

3535

ScanCount

1 2 3

1

3

5

10

13

10

13

15

5

7

13

13 15

Count threshold T≥ 4

# of occurrences# of occurrences

00

00

00

44

11

Increment by 1Increment by 111

String idsString ids

1313

1414

1515

00

22

00

00

Result!Result!

Page 36: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

List-Merging Algorithms

HeapMerger MergeOpt

ScanCount MergeSkip DivideSkip

[SK04]

[LLL08, BK02]

36

Page 37: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

MergeSkip algorithm [BK02, LLL08]

Min-heap ……Pop T-1

T-1

Jump Greater or

equals

Greater or

equals

37

Page 38: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Example of MergeSkip

1

3

5

10

10

15

5

7

13 15

Count threshold T≥ 4

minHeap10

13 15

15

JumpJump

15151515

13131313

17171717

38

Page 39: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

DivideSkip Algorithm [LLL08]

Long Lists Short Lists

Binary

searchMergeSkip

39

Page 40: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

How many lists are treated as long lists?

40

Page 41: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Length Filtering

Ed(s,t) ≤ 2

s: s:

t: t:

Length: 19Length: 19

Length: 10Length: 10

By length only!By length only!

41

Page 42: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Positional Filtering

a b

a b

Ed(s,t) ≤ 2

s

t

(ab,1)

(ab,12)

42

Page 43: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Normalized weights [HKS08]

Compute a weight for each string L0: length of the string

L1, L2: Depend on q-gram frequencies

Similar strings have similar weights A very strong pruning condition

43

Page 44: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Pruning using normalized weights

Sort inverted lists based on string weights Search within a small weight range Shown to be effective (> 90% candidates pruned)

44

Page 45: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Variable-length grams (VGRAM) [LWY07,YWL08]

45

Next…

Page 46: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

# of common grams >= 1

2-grams -> 3-grams? Query: “shtick”, ED(shtick, ?)≤1

id strings01234

richstickstichstuckstatic

3-grams

atiichickricstastistutattictucuck

tic icksht hti tic ick

4

24

1

2010

3413

42

3

id strings01234

richstickstichstuckstatic

id strings01234

richstickstichstuckstatic

46

Page 47: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Observation 1: dilemma of choosing “q”

Increasing “q” causing: Longer grams Shorter lists Smaller # of common grams of similar strings

id strings01234

richstickstichstuckstatic

4

2 30

1 4

2-grams

atchckicristtatituuc

201 30 1 2 4

41 2 433 47

Page 48: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio

48

Page 49: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

VGRAM: Main idea

Grams with variable lengths (between qmin and qmax) zebra

- ze(123) corrasion

- co(5213), cor(859), corr(171) Advantages

Reduce index size Reducing running time Adoptable by many algorithms

49

Page 50: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Challenges

Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their

gram-set similarity? Adopting VGRAM in existing algorithms?

50

Page 51: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Challenge 1: String Variable-length grams?

Fixed-length 2-grams

Variable-length grams

u n i v e r s a l

niivrsalunivers

[2,4]-gram dictionaryu n i v e r s a l

51

Page 52: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Representing gram dictionary as a trie

niivrsalunivers

52

Page 53: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

53

Step 2: Constructing a gram dictionary

qmin=2

qmax=4

Frequency-based [LYW07] Cost-based [YLW08]

Page 54: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Challenge 3: Edit operation’s effect on grams

k operations could affect k * q grams

u n i v e r s a lFixed length: q

54

Page 55: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Deletion affects variable-length grams

i-qmax+1 i+qmax- 1Deletion

Not affected Not affectedAffected

i

55

Page 56: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Main idea

For a string, for each position, compute the number of grams that could be destroyed by an operation at this position

Compute number of grams possibly destroyed by k operations Store these numbers (for all data strings) as part of the index

56

Vector of s = <2,4,6,8,9>

With 2 edit operations, at most 4 grams can be affected

Use this number to do count filtering

Page 57: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Summary of VGRAM index

57

Page 58: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Challenge 4: adopting VGRAM

Easily adoptable by many algorithms

Basic interfaces: String s grams String s1, s2 such that ed(s1,s2) <= k min

# of their common grams

58

Page 59: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Lower bound on # of common grams

If ed(s1,s2) <= k, then their # of common grams >=:

(|s1|- q + 1) – k * q

u n i v e r s a l

Fixed length (q)

Variable lengths: # of grams of s1 – NAG(s1,k)

59

Page 60: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

Example: algorithm using inverted lists

Query: “shtick”, ED(shtick, ?)≤1

1 2 4

1 210 4

3

…ckic…ti…

Lower bound = 3

Lower bound = 1

sh ht tick

2 4

1

4

0

11

2

3

…ckicich…tictick…

2-4 grams2-gramstick

id strings01234

richstickstichstuckstatic

id strings01234

richstickstichstuckstatic

id strings01234

richstickstichstuckstatic

60

Page 61: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1

End of part I

Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion

61

Part I

Part II