efficient approximate search on string collections part i marios hadjieleftheriou chen li 1
TRANSCRIPT
Efficient Approximate Search on String CollectionsPart I
Marios Hadjieleftheriou Chen Li
1
DBLP Author Search
http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html 2
Try their names (good luck!)
http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html 3
Yannis Papakonstantinou Meral Ozsoyoglu Marios Hadjieleftheriou
UCSD Case Western AT&T--Research
4
Better system?
5
http://dblp.ics.uci.edu/authors/
People Search at UC Irvine
6
http://psearch.ics.uci.edu/
Web Search
http://www.google.com/jobs/britney.html
Errors in queries
Errors in data
Bring query and meaningful
results closer togetherActual queries gathered by Google
7
Data Cleaning
R
informix
microsoft
…
…
S
infromix
…mcrosoft
…
8
Problem Formulation
Find strings similar to a given string: dist(Q,D) <= δ
Example: find strings similar to “hadjeleftheriou”
Performance is important!-10 ms: 100 queries per second (QPS)- 5 ms: 200 QPS
9
Outline
Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion
10
Part I
Part II
Preliminaries
11
Next…
Similarity Functions Similar to:
a domain-specific function returns a similarity value between two strings
Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE See [KSS06] for an excellent survey
12
13
A widely used metric to define string similarityEd(s1,s2) = minimum # of operations (insertion,
deletion, substitution) to change s1 to s2Example:
s1: Tom Hanks
s2: Ton Hank
ed(s1,s2) = 2
Edit Distance
13
State-of-the-art: Oracle 10g and older versions Supported by Oracle Text CREATE TABLE engdict(word VARCHAR(20), len INT); Create preferences for text indexing:
begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; /
CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF');
Usage:
SELECT * FROM engdict
WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0;
Limitation: cannot handle errors in the first letters:
Katherine versus Catherine
14
15
Microsoft SQL Server [CGG+05]
Data cleaning tools available in SQL Server 2005 Part of Integration Services Supports fuzzy lookups Uses data flow pipeline of transformations Similarity function: tokens with TF/IDF scores
15
Lucene Using Levenshtein Distance (Edit Distance). Example: roam~0.8 Prefix pruning followed by a scan
(Efficiency?)
16
Trie-based approach [JLL+09]
17
Next…
Trie Indexing
e
x
a
m
p
l
$
$
e
m
p
l
a
r
$
t
$
s
a
m
p
l
e
$e
Strings
exam
example
exemplar
exempt
sample
18
Active nodes on Trie
e
x
a
m
p
l
$
$
e
m
p
l
a
r
$
t
$
s
a
m
p
l
e
$e
Prefix Distance
examp 2
exampl 1
example 0
exempl 2
exempla 2
sample 2
Query: “example”
Edit-distance threshold = 2
2
1
0
2
2
2
19
Initialization
e
x
a
m
p
l
$
$
e
m
p
l
a
r
$
t
$
s
a
m
p
l
e
$e
Q = ε 0
1 1
2 2
Prefix DistancePrefix Distance
0
e 1
ex 2
s 1
sa 2
Prefix Distance
ε 0
Initial active nodes: all nodes within depth δ
20
Incremental Algorithm
Return leaf nodes as answers.21
e
Q = e x a m p l e
e
x
a
m
p
l
$
$
e
m
p
l
a
r
$
t
$
s
a
m
p
l
e
$
Prefix Distance
ε 0
e 1
ex 2
s 1
sa 2
Prefix # Op Base Op
ε 1 ε del e
s 1 ε sub e/s
e 0 ε mat e
ex 1 ε ins x
exa 2 ε Ins xa
exe 2 ε Ins xe
Prefix # Op Base OpPrefix # Op Base Op
ε 1 ε del e
Prefix # Op Base Op
ε 1 ε del e
s 1 ε sub e/s
Prefix # Op Base Op
ε 1 ε del e
s 1 ε sub e/s
e 0 ε mat e
1
10
1
2 2
e 2 e del e
ex 2 e sub e/xex 3 ex del e
exa 3 ex sub e/a
exe 2 ex mat e
s 2 s del e
sa 2 s sub e/asa 3 sa del e
Active nodes for Q
=
ε
Active nodes for Q
=
e
2
22
23
Advantages: Trie size is small Can do search as the user types
DisadvantagesWorks for edit distance only
Good and bad
23
Gram-based algorithmsList-merging algorithms [LLL08]Variable-length grams (VGRAM)
[LWY07,YWL08]
24
Next…
“q-grams” of strings
u n i v e r s a l
2-grams
25
Edit operation’s effect on grams
k operations could affect k * q grams
u n i v e r s a lFixed length: q
26
If ed(s1,s2) <= k, then their # of common grams >=
(|s1|- q + 1) – k * q
q-gram inverted lists
id strings01234
richstickstichstuckstatic
4
2 30
1 4
2-grams
atchckicristtatituuc
201 30 1 2 4
41 2 433 27
# of common grams >= 3
Searching using inverted lists Query: “shtick”, ED(shtick, ?)≤1
id strings01234
richstickstichstuckstatic
2-grams
atchckicristtatituuc
4
2 30
1 4
201 30 1 2 4
41 2 433
ti ic cksh ht ti ic ck
28
T-occurrence Problem
Find elements whose occurrences ≥ T
Ascending
order
Ascending
order
MergeMerge
29
Example T = 4
Result: 13
1
3
5
10
13
10
13
15
5
7
13
13 15
30
List-Merging Algorithms
HeapMerger MergeOpt
[SK04]
[LLL08, BK02]
ScanCount MergeSkip DivideSkip
31
Heap-based Algorithm
Min-heap
Count # of occurrences of each element using a heap
Push to heap ……
32
MergeOpt Algorithm [SK04]
Long Lists: T-1 Short Lists
Binary
search
33
Example of MergeOpt
1
3
5
10
13
10
13
15
5
7
13
13 15
Count threshold T≥ 4
Long Lists: 3Short Lists: 2
34
3535
ScanCount
1 2 3
…
1
3
5
10
13
10
13
15
5
7
13
13 15
Count threshold T≥ 4
# of occurrences# of occurrences
00
00
00
44
11
Increment by 1Increment by 111
String idsString ids
1313
1414
1515
00
22
00
00
Result!Result!
List-Merging Algorithms
HeapMerger MergeOpt
ScanCount MergeSkip DivideSkip
[SK04]
[LLL08, BK02]
36
MergeSkip algorithm [BK02, LLL08]
Min-heap ……Pop T-1
T-1
Jump Greater or
equals
Greater or
equals
37
Example of MergeSkip
1
3
5
10
10
15
5
7
13 15
Count threshold T≥ 4
minHeap10
13 15
15
JumpJump
15151515
13131313
17171717
38
DivideSkip Algorithm [LLL08]
Long Lists Short Lists
Binary
searchMergeSkip
39
How many lists are treated as long lists?
40
Length Filtering
Ed(s,t) ≤ 2
s: s:
t: t:
Length: 19Length: 19
Length: 10Length: 10
By length only!By length only!
41
Positional Filtering
a b
a b
Ed(s,t) ≤ 2
s
t
(ab,1)
(ab,12)
42
Normalized weights [HKS08]
Compute a weight for each string L0: length of the string
L1, L2: Depend on q-gram frequencies
Similar strings have similar weights A very strong pruning condition
43
Pruning using normalized weights
Sort inverted lists based on string weights Search within a small weight range Shown to be effective (> 90% candidates pruned)
44
Variable-length grams (VGRAM) [LWY07,YWL08]
45
Next…
# of common grams >= 1
2-grams -> 3-grams? Query: “shtick”, ED(shtick, ?)≤1
id strings01234
richstickstichstuckstatic
3-grams
atiichickricstastistutattictucuck
tic icksht hti tic ick
4
24
1
2010
3413
42
3
id strings01234
richstickstichstuckstatic
id strings01234
richstickstichstuckstatic
46
Observation 1: dilemma of choosing “q”
Increasing “q” causing: Longer grams Shorter lists Smaller # of common grams of similar strings
id strings01234
richstickstichstuckstatic
4
2 30
1 4
2-grams
atchckicristtatituuc
201 30 1 2 4
41 2 433 47
Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio
48
VGRAM: Main idea
Grams with variable lengths (between qmin and qmax) zebra
- ze(123) corrasion
- co(5213), cor(859), corr(171) Advantages
Reduce index size Reducing running time Adoptable by many algorithms
49
Challenges
Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their
gram-set similarity? Adopting VGRAM in existing algorithms?
50
Challenge 1: String Variable-length grams?
Fixed-length 2-grams
Variable-length grams
u n i v e r s a l
niivrsalunivers
[2,4]-gram dictionaryu n i v e r s a l
51
Representing gram dictionary as a trie
niivrsalunivers
52
53
Step 2: Constructing a gram dictionary
qmin=2
qmax=4
Frequency-based [LYW07] Cost-based [YLW08]
Challenge 3: Edit operation’s effect on grams
k operations could affect k * q grams
u n i v e r s a lFixed length: q
54
Deletion affects variable-length grams
i-qmax+1 i+qmax- 1Deletion
Not affected Not affectedAffected
i
55
Main idea
For a string, for each position, compute the number of grams that could be destroyed by an operation at this position
Compute number of grams possibly destroyed by k operations Store these numbers (for all data strings) as part of the index
56
Vector of s = <2,4,6,8,9>
With 2 edit operations, at most 4 grams can be affected
Use this number to do count filtering
Summary of VGRAM index
57
Challenge 4: adopting VGRAM
Easily adoptable by many algorithms
Basic interfaces: String s grams String s1, s2 such that ed(s1,s2) <= k min
# of their common grams
58
Lower bound on # of common grams
If ed(s1,s2) <= k, then their # of common grams >=:
(|s1|- q + 1) – k * q
u n i v e r s a l
Fixed length (q)
Variable lengths: # of grams of s1 – NAG(s1,k)
59
Example: algorithm using inverted lists
Query: “shtick”, ED(shtick, ?)≤1
1 2 4
1 210 4
3
…ckic…ti…
Lower bound = 3
Lower bound = 1
sh ht tick
2 4
1
4
0
11
2
3
…ckicich…tictick…
2-4 grams2-gramstick
id strings01234
richstickstichstuckstatic
id strings01234
richstickstichstuckstatic
id strings01234
richstickstichstuckstatic
60
End of part I
Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion
61
Part I
Part II