efficient approximate search on string collections part i marios hadjieleftheriou chen li 1

Efficient Approximate Search on String CollectionsPart I

Marios Hadjieleftheriou Chen Li

1

http://www.uci.edu/

DBLP Author Search

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html 2

Try their names (good luck!)

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html 3

Yannis Papakonstantinou Meral Ozsoyoglu Marios Hadjieleftheriou

UCSD Case Western AT&T--Research

Better system?

5

http://dblp.ics.uci.edu/authors/

People Search at UC Irvine

6

http://psearch.ics.uci.edu/

Web Search

http://www.google.com/jobs/britney.html

Errors in queries

Errors in data

Bring query and meaningful

results closer togetherActual queries gathered by Google

7

Data Cleaning

R

informix

microsoft

…

…

S

infromix

…mcrosoft

…

8

Problem Formulation

Find strings similar to a given string: dist(Q,D) <= δ

Example: find strings similar to “hadjeleftheriou”

Performance is important!-10 ms: 100 queries per second (QPS)- 5 ms: 200 QPS

9

Outline

Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion

10

Part I

Part II

Preliminaries

11

Next…

Similarity Functions Similar to:

a domain-specific function returns a similarity value between two strings

Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE See [KSS06] for an excellent survey

12

13

A widely used metric to define string similarityEd(s1,s2) = minimum # of operations (insertion,

deletion, substitution) to change s1 to s2Example:

s1: Tom Hanks

s2: Ton Hank

ed(s1,s2) = 2

Edit Distance

13

State-of-the-art: Oracle 10g and older versions Supported by Oracle Text CREATE TABLE engdict(word VARCHAR(20), len INT); Create preferences for text indexing:

begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; /

CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF');

Usage:

SELECT * FROM engdict

WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0;

Limitation: cannot handle errors in the first letters:

Katherine versus Catherine

14

15

Microsoft SQL Server [CGG+05]

Data cleaning tools available in SQL Server 2005 Part of Integration Services Supports fuzzy lookups Uses data flow pipeline of transformations Similarity function: tokens with TF/IDF scores

15

Lucene Using Levenshtein Distance (Edit Distance). Example: roam~0.8 Prefix pruning followed by a scan

(Efficiency?)

16

Trie-based approach [JLL+09]

17

Next…

Trie Indexing

e

x

a

m

p

l

$

$

e

m

p

l

a

r

$

t

$

s

a

m

p

l

e

$e

Strings

exam

example

exemplar

exempt

sample

18

Active nodes on Trie

e

x

a

m

p

l

$

$

e

m

p

l

a

r

$

t

$

s

a

m

p

l

e

$e

Prefix Distance

examp 2

exampl 1

example 0

exempl 2

exempla 2

sample 2

Query: “example”

Edit-distance threshold = 2

2

1

0

2

2

2

19

Initialization

e

x

a

m

p

l

$

$

e

m

p

l

a

r

$

t

$

s

a

m

p

l

e

$e

Q = ε 0

1 1

2 2

Prefix DistancePrefix Distance

0

e 1

ex 2

s 1

sa 2

Prefix Distance

ε 0

Initial active nodes: all nodes within depth δ

20

Incremental Algorithm

Return leaf nodes as answers.21

e

Q = e x a m p l e

e

x

a

m

p

l

$

$

e

m

p

l

a

r

$

t

$

s

a

m

p

l

e

$

Prefix Distance

ε 0

e 1

ex 2

s 1

sa 2

Prefix # Op Base Op

ε 1 ε del e

s 1 ε sub e/s

e 0 ε mat e

ex 1 ε ins x

exa 2 ε Ins xa

exe 2 ε Ins xe

Prefix # Op Base OpPrefix # Op Base Op

ε 1 ε del e

Prefix # Op Base Op

ε 1 ε del e

s 1 ε sub e/s

Prefix # Op Base Op

ε 1 ε del e

s 1 ε sub e/s

e 0 ε mat e

1

10

1

2 2

e 2 e del e

ex 2 e sub e/xex 3 ex del e

exa 3 ex sub e/a

exe 2 ex mat e

s 2 s del e

sa 2 s sub e/asa 3 sa del e

Active nodes for Q

=

ε

Active nodes for Q

=

e

2

22

23

Advantages: Trie size is small Can do search as the user types

DisadvantagesWorks for edit distance only

Good and bad

23

Gram-based algorithmsList-merging algorithms [LLL08]Variable-length grams (VGRAM)

[LWY07,YWL08]

24

Next…

“q-grams” of strings

u n i v e r s a l

2-grams

25

Edit operation’s effect on grams

k operations could affect k * q grams

u n i v e r s a lFixed length: q

26

If ed(s1,s2) <= k, then their # of common grams >=

(|s1|- q + 1) – k * q

q-gram inverted lists

id strings01234

richstickstichstuckstatic

4

2 30

1 4

2-grams

atchckicristtatituuc

201 30 1 2 4

41 2 433 27

# of common grams >= 3

Searching using inverted lists Query: “shtick”, ED(shtick, ?)≤1

id strings01234


2-grams


4

2 30

1 4

201 30 1 2 4

41 2 433

ti ic cksh ht ti ic ck

28

T-occurrence Problem

Find elements whose occurrences ≥ T

Ascending

order

Ascending

order

MergeMerge

29

Example T = 4

Result: 13

1

3

5

10

13

10

13

15

5

7

13

13 15

30

List-Merging Algorithms

HeapMerger MergeOpt

[SK04]

[LLL08, BK02]

ScanCount MergeSkip DivideSkip

31

Heap-based Algorithm

Min-heap

Count # of occurrences of each element using a heap

Push to heap ……

32

MergeOpt Algorithm [SK04]

Long Lists: T-1 Short Lists

Binary

search

33

Example of MergeOpt

1

3

5

10

13

10

13

15

5

7

13

13 15

Count threshold T≥ 4

Long Lists: 3Short Lists: 2

34

3535

ScanCount

1 2 3

…

1

3

5

10

13

10

13

15

5

7

13

13 15


# of occurrences# of occurrences

00

00

00

44

11

Increment by 1Increment by 111

String idsString ids

1313

1414

1515

00

22

00

00

Result!Result!

List-Merging Algorithms

HeapMerger MergeOpt

ScanCount MergeSkip DivideSkip

[SK04]

[LLL08, BK02]

36

MergeSkip algorithm [BK02, LLL08]

Min-heap ……Pop T-1

T-1

Jump Greater or

equals

Greater or

equals

37

Example of MergeSkip

1

3

5

10

10

15

5

7

13 15


minHeap10

13 15

15

JumpJump

15151515

13131313

17171717

38

DivideSkip Algorithm [LLL08]

Long Lists Short Lists

Binary

searchMergeSkip

39

How many lists are treated as long lists?

40

Length Filtering

Ed(s,t) ≤ 2

s: s:

t: t:

Length: 19Length: 19

Length: 10Length: 10

By length only!By length only!

41

Positional Filtering

a b

a b

Ed(s,t) ≤ 2

s

t

(ab,1)

(ab,12)

42

Normalized weights [HKS08]

Compute a weight for each string L0: length of the string

L1, L2: Depend on q-gram frequencies

Similar strings have similar weights A very strong pruning condition

43

Pruning using normalized weights

Sort inverted lists based on string weights Search within a small weight range Shown to be effective (> 90% candidates pruned)

44

Variable-length grams (VGRAM) [LWY07,YWL08]

45

Next…

# of common grams >= 1

2-grams -> 3-grams? Query: “shtick”, ED(shtick, ?)≤1

id strings01234


3-grams

atiichickricstastistutattictucuck

tic icksht hti tic ick

4

24

1

2010

3413

42

3

id strings01234


id strings01234


46

Observation 1: dilemma of choosing “q”

Increasing “q” causing: Longer grams Shorter lists Smaller # of common grams of similar strings

id strings01234


4

2 30

1 4

2-grams


201 30 1 2 4

41 2 433 47

Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio

48

VGRAM: Main idea

Grams with variable lengths (between qmin and qmax) zebra

- ze(123) corrasion

- co(5213), cor(859), corr(171) Advantages

Reduce index size Reducing running time Adoptable by many algorithms

49

Challenges

Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their

gram-set similarity? Adopting VGRAM in existing algorithms?

50

Challenge 1: String Variable-length grams?

Fixed-length 2-grams

Variable-length grams

u n i v e r s a l

niivrsalunivers

[2,4]-gram dictionaryu n i v e r s a l

51

Representing gram dictionary as a trie

niivrsalunivers

52

53

Step 2: Constructing a gram dictionary

qmin=2

qmax=4

Frequency-based [LYW07] Cost-based [YLW08]

Challenge 3: Edit operation’s effect on grams

k operations could affect k * q grams

u n i v e r s a lFixed length: q

54

Deletion affects variable-length grams

i-qmax+1 i+qmax- 1Deletion

Not affected Not affectedAffected

i

55

Main idea

For a string, for each position, compute the number of grams that could be destroyed by an operation at this position

Compute number of grams possibly destroyed by k operations Store these numbers (for all data strings) as part of the index

56

Vector of s = <2,4,6,8,9>

With 2 edit operations, at most 4 grams can be affected

Use this number to do count filtering

Summary of VGRAM index

57

Challenge 4: adopting VGRAM

Easily adoptable by many algorithms

Basic interfaces: String s grams String s1, s2 such that ed(s1,s2) <= k min

# of their common grams

58

Lower bound on # of common grams

If ed(s1,s2) <= k, then their # of common grams >=:

(|s1|- q + 1) – k * q

u n i v e r s a l

Fixed length (q)

Variable lengths: # of grams of s1 – NAG(s1,k)

59

Example: algorithm using inverted lists

Query: “shtick”, ED(shtick, ?)≤1

1 2 4

1 210 4

3

…ckic…ti…

Lower bound = 3

Lower bound = 1

sh ht tick

2 4

1

4

0

11

2

3

…ckicich…tictick…

2-4 grams2-gramstick

id strings01234


id strings01234


id strings01234


60

End of part I

Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion

61

Part I

Part II

efficient approximate search on string collections part i marios hadjieleftheriou chen li 1

Documents

q slide

research slide

eduauthors slide

e active nodes

trie e x

initialization e x

e s1sub es prefix

trie indexing e x