efficient approximate search on string collections part ii marios hadjieleftheriouchen li

68
Efficient Approximate Search on String Collections Part II Marios Hadjieleftheriou Chen Li

Upload: dayna-booth

Post on 06-Jan-2018

223 views

Category:

Documents


3 download

DESCRIPTION

N-Gram Signatures Use string signatures that upper bound similarity Use signatures as filtering step Properties: Signature has to have small size Signature verification must be fast False positives/False negatives Signatures have to be “indexable” 3/68

TRANSCRIPT

Page 1: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Efficient Approximate Search on String CollectionsPart II

Marios Hadjieleftheriou Chen Li

Page 2: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Outline Motivation and preliminaries Inverted list based algorithms Gram Signature algorithms Length normalized algorithms Selectivity estimation Conclusion and future directions

2/68

Page 3: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

N-Gram Signatures Use string signatures that upper bound

similarity Use signatures as filtering step Properties:

Signature has to have small size Signature verification must be fast False positives/False negatives

Signatures have to be “indexable”

3/68

Page 4: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Known signatures Minhash

Jaccard, Edit distance Prefix filter (CGK06)

Jaccard, Edit distance PartEnum (AGK06)

Hamming, Jaccard, Edit distance LSH (GIM99)

Jaccard, Edit distance Mismatch filter (XWL08)

Edit distance

4/68

Page 5: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Prefix Filter Bit vectors:

Mismatch vector:

s: matches 6, missing 2, extra 2 If |sq|6 then s’s s.t. |s’|3, |s’q| For at least k matches, |s’| = l - k + 1

1 2 6 1411

q

s

3 4 5 7 8 9 10 12 13

5/68

Page 6: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Using Prefixes Take a random permutation of n-gram

universe:

Take prefixes from both sets: |s’|=|q’|=3, if |sq|6 then s’q’

6 9 1 137

q

s

11 14 8 2 3 4 5 10 12

6/68

Page 7: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

t1 t2 t6 t14t11

sq

t4 t8

Prefix Filter for Weighted Sets For example:

Order n-grams by weight (new coordinate space)

Query: w(qs)=Σiqs wi τ Keep prefix s’ s.t. w(s’) w(s) - α

Best case: w(q/q’ s/s’) = α Hence, we need w(q’s’) τ - α

w1 w2 … w14

w1 w2 w40 0w1 w2 w40 0

w(s)-α αs/s’s’

7/68

Page 8: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Prefix Filter Properties The larger we make α, the smaller the prefix The larger we make α, the smaller the range

of thresholds we can support: Because τα, otherwise τ-α is negative.

We need to pre-specify minimum τ Can apply to Jaccard, Edit Distance, IDF

8/68

Page 9: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Other Signatures Minhash (still to come) PartEnum:

Upper bounds Hamming Select multiple subsets instead of one prefix Larger signature, but stronger guarantee

LSH: Probabilistic with guarantees Based on hashing

Mismatch filter: Use positional mismatching n-grams within the prefix to

attain lower bound of Edit Distance

9/68

Page 10: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Signature Indexing Straightforward solution:

Create an inverted index on signature n-grams Merge inverted lists to compute signature

intersections For a given string q:

- Access only lists in q’- Find strings s with w(q’ ∩ s’) ≥ τ - α

10/68

Page 11: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

The Inverted Signature Hashtable (CCVX08) Maintain a signature vector for every n-gram Consider prefix signatures for simplicity:

s’1={ ‘tt ’, ‘t L’}, s’2={‘t&t’, ‘t L’}, s’3=… co-occurence lists: ‘t L’: ‘tt ’ ‘t&t’ …

‘&tt’: ‘t L’ … Hash all n-grams (h: n-gram [0, m]) Convert co-occurrence lists to bit-vectors of size m

11/68

Page 12: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Example

at&t&tlabt L la…

Hashtable100011010101 …

labat&t&tt L la…

Hash54510

s’1s’2s’3s’4s’5…

Signaturesat&, lat&t, at&t L, at&abo, t&tt&t, la

12/68

Page 13: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Using the Hashtable? Let list ‘at&’ correspond to bit-vector 100011

There exists string s s.t. ‘at&’ s’ and s’ also contains some n-grams that hash to 0, 1, or 5

Given query q: Construct query signature matrix:

Consider only solid sub-matrices P: rq’, pq We need to look only at rq’ such that w(r)τ-α and

w(p)τ

p

rat&

lab

q’q at& lab t&t res …

1 1 1 0

1 1 0 1

13/68

Page 14: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Verification How do we find which strings correspond to a

given sub-matrix? Create an inverted index on string n-grams Examine only lists in r and strings with w(s)τ

- Remember that rq’ Can be used with other signatures as well

14/68

Page 15: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Outline Motivation and preliminaries Inverted list based algorithms Gram Signature algorithms Length normalized algorithms Selectivity estimation Conclusion and future directions

15/68

Page 16: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Length Normalized Measures What is normalization?

Normalize similarity scores by the length of the strings.- Can result in more meaningful matches.

Can use L0 (i.e., the length of the string), L1, L2, etc. For example L2:

- Let w2(s) Σtsw(t)2

- Weight can be IDF, unary, language model, etc.- ||s||2 = w2(s)-1/2

16/68

Page 17: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

The L2-Length Filter (HCKS08)

Why L2? For almost exact matches. Two strings match only if:

- They have very similar n-gram sets, and hence L2 lengths

- The “extra” n-grams have truly insignificant weights in aggregate (hence, resulting in similar L2 lengths).

17/68

Page 18: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Example “AT&T Labs – Research” L2=100 “ATT Labs – Research” L2=95 “AT&T Labs” L2=70

If “Research” happened to be very popular and had small weight?

“The Dark Knight” L2=75 “Dark Night” L2=72

18/68

Page 19: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Why L2 (continued) Tight L2-based length filtering will result in very

efficient pruning. L2 yields scores bounded within [0, 1]:

1 means a truly perfect match. Easier to interpret scores. L0 and L1 do not have the same properties

- Scores are bounded only by the largest string length in the database.

- For L0 an exact match can have score smaller than a non-exact match!

19/68

Page 20: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Example q={‘ATT’, ‘TT ’, ‘T L’, ‘LAB’, ‘ABS’} L0=5 s1={‘ATT’} L0=1 s2=q L0=5

S(q, s1)=Σw(qs1)/(||q||0 ||s1||0)=10/5 = 2 S(q, s2)=Σw(qs2)/(||q||0 ||s2||0)=40/25<2

20/68

Page 21: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Problems L2 normalization poses challenges.

For example:- S(q, s) = w2(qs)/(||q||2 ||s||2) - Prefix filter cannot be applied.- Minimum prefix weight α?

Value depends both on ||s||2 and ||q||2. But ||q||2 is unknown at index construction time

21/68

Page 22: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Important L2 Properties Length filtering:

For S(q, s) ≥ τ τ ||q||2 ||s||2 ||q||2 / τ We are only looking for strings within these

lengths. Proof in paper

Monotonicity …

22/68

Page 23: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Monotonicity Let s={t1, t2, …, tm}. Let pw(s, t)=w(t) / ||s||2 (partial weight of s)

Then: S(q, s) = Σ tqs w(t)2 / (||q||2 ||s||2)=

Σtqs pw(s, t) pw(q, t) If pw(s, t) > pw(r, t):

w(t)/||s||2 > w(t)/||r||2 ||s||2 < ||r||2

Hence, for any t’ t: w(t’)/||s||2 > w(t’)/||r||2 pw(s, t’) > pw(r, t’)

23/68

Page 24: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Indexing Use inverted lists sorted by pw():

id strings01234

richstickstichstuckstatic

4

3 104 2

2-grams

atchckicristtatituuc

203 10 4 1 2

44 1 233

• pw(0, ic) > pw(4, ic) > pw(1, ic) > pw(2, ic)

||0||2 < ||4||2 < ||1||2 < ||2||2 24/68

Page 25: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

L2 Length Filter Given q and τ, and using length filtering:

02

2

2

4

4

4

00

44

• We examine only a small fraction of the lists

1

atchckicristtatituuc

21

1

1

3

3

33

4

04

00

44

4

2

2

2

25/68

Page 26: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Monotonicity

4

3 104 2

atchckicristtatituuc

203 10 4 1 2

44 1 233

If I have seen 1 already, then 4 is not in the list:

26/68

Page 27: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Other Improvements Use properties of weighting scheme

Scan high weight lists first Prune according to string length and maximum

potential score Ignore low weight lists altogether

27/68

Page 28: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Conclusion Concepts can be extended easily for:

BM25 Weighted Jaccard DICE IDF

Take away message: Properties of similarity/distance function can play

big role in designing very fast indexes. L2 super fast for almost exact matches

28/68

Page 29: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Outline Motivation and preliminaries Inverted list based algorithms Gram signature algorithms Length-normalized measures Selectivity estimation Conclusion and future directions

29/68

Page 30: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

The Problem Estimate the number of strings with:

Edit distance smaller than k from query q Cosine similarity higher than τ to query q Jaccard, Hamming, etc…

Issues: Estimation accuracy Size of estimator Cost of estimation

30/68

Page 31: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Motivation Query optimization:

Selectivity of query predicates Need to support selectivity of approximate string

predicates Visualization/Querying:

Expected result set size helps with visualization Result set size important for remote query

processing

31/68

Page 32: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Flavors Edit distance:

Based on clustering (JL05)

Based on min-hash (MBKS07)

Based on wild-card n-grams (LNS07)

Cosine similarity: Based on sampling (HYKS08)

32/68

Page 33: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Selectivity Estimation for Edit Distance Problem:

Given query string q Estimate number of strings s D Such that ed(q, s) δ

33/68

Page 34: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Sepia - Clustering (JL05, JLV08)

Partition strings using clustering: Enables pruning of whole clusters

Store per cluster histograms: Number of strings within edit distance 0,1,…,δ from the cluster

center Compute global dataset statistics:

Use a training query set to compute frequency of strings within edit distance 0,1,…,δ from each query

34/68

Page 35: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Edit Vectors Edit distance is not discriminative:

Use Edit Vectors

3D space vs 1D space

pi

LuciaLuciano

Lucas2

2Lukas

q3

Ci<1,1,1> <2,0,0>

<1,1,0>

35/68

Page 36: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Visually...p1C1 p2

C2 pnCn

#Edit Vector

<0, 0, 0>

<0, 0, 1>

<1, 0, 2>

4

12

7

F1

#Edit Vector

<0, 0, 0>

<0, 1, 0>

<1, 0, 1>

3

40

6

F2

#Edit Vector

<0, 0, 0>

<1, 0, 2>

<1, 1, 1>

2

84

1

Fn

Global Table

v(q,pi) v(pi,s) ed(q,s) # %

<1, 0, 1>

<1, 0, 1>

<1, 0, 1>

<1, 1, 0>

<1, 1, 0>

<1, 1, 0> <1, 0, 2>

<1, 0, 2>

<1, 0, 2>

<0, 0, 1>

<0, 0, 1>

<0, 0, 1>

… …

1

2

3

3

4

5

1

4

7

21

63

84

14

57

100

25

75

100

36/68

Page 37: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Selectivity Estimation Use triangle inequality:

Compute edit vector v(q,pi) for all clusters i If |v(q,pi)| ri+δ disregard cluster Ci

ri

δ

piq

37/68

Page 38: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Selectivity Estimation Use triangle inequality:

Compute edit vector v(q,pi) for all clusters i If |v(q,pi)| ri+δ disregard cluster Ci

For all entries in frequency table:- If |v(q,pi)| + |v(pi,s)| δ then ed(q,s) δ for all s- If ||v(q,pi)| - |v(pi,s)|| δ ignore these strings- Else use global table:

Lookup entry <v(q,pi), v(pi,s), δ> in global table Use the estimated fraction of strings

38/68

Page 39: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Example δ =3 v(q,p1) = <1,1,0>

v(p1,s) = <1,0,2>

Global lookup:[<1,1,0>,<1,0,2>, 3] Fraction is 25% x 7 =

1.75 Iterate through F1, and

add up contributions

#Edit Vector

<0, 0, 0>

<0, 0, 1>

<1, 0, 2>

4

12

7

F1

v(q,pi) v(pi,s) ed(q,s) # %

<1, 0, 1>

<1, 0, 1>

<1, 0, 1>

<1, 1, 0>

<1, 1, 0>

<1, 1, 0> <1, 0, 2>

<1, 0, 2>

<1, 0, 2>

<0, 0, 1>

<0, 0, 1>

<0, 0, 1>

… …

1

2

3

3

4

5

1

4

7

21

63

84

14

57

100

25

75

100

Global Table

39/68

Page 40: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Cons Hard to maintain if clusters start drifting Hard to find good number of clusters

Space/Time tradeoffs Needs training to construct good dataset

statistics table

40/68

Page 41: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

VSol – minhash (MBKS07)

Solution based on minhash minhash is used for:

Estimate the size of a set |s| Estimate resemblance of two sets

- I.e., estimating the size of J=|s1s2| / |s1s2|

Estimate the size of the union |s1s2| Hence, estimating the size of the intersection

- |s1s2| J~(s1, s2) ~(s1, s2)

41/68

Page 42: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Minhash Given a set s = {t1, …, tm} Use independent hash functions h1, …, hk:

hi: n-gram [0, 1] Hash elements of s, k times Keep the k elements that hashed to the

smallest value each time We reduced set s, from m to k elements Denote minhash signature with s’

42/68

Page 43: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

How to use minhash Given two signatures q’, s’:

J(q, s) Σ1ik I{q’[i]=s’[i]} / k |s| ( k / Σ1ik s’[i] ) - 1 (qs)’ = q’ s’ = min1ik(q’[i], s’[i]) Hence:

- |qs| (k / Σ1ik (qs)’[i]) - 1

43/68

Page 44: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

VSol Estimator Construct one inverted list per n-gram in D

The lists are our sets Compute a minhash signature for each list

t1 t2 t10…

15

25

35

14

18

43

…Inverted list

Minhash

44/68

Page 45: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Selectivity Estimation Use edit distance length filter:

If ed(q, s) δ, then q and s share at least L = |s| - 1 - n (δ-1)

n-grams Given query q = {t1, …, tm}:

Answer is the size of the union of all non-empty L-intersections (binomial coefficient: m choose L)

We can estimate sizes of L-intersections using minhash signatures

45/68

Page 46: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Example δ = 2, n = 3 L = 6

Look at all 6-intersections of inverted lists Α = |ι1, ..., ι6 [1,10] (ti1 ti2 … ti6)| There are (10 choose 6) such terms

q = t1 t2 t10…

15

25

35

14

18

43

…Inverted list

46/68

Page 47: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

The m-L Similarity Can be done efficiently using

minhashesAnswer:

ρ = Σ1jk I{ i1, …, iL: ti1’[j] = … = tiL’[j] } A ρ |t1… tm|

Proof very similar to the proof for minhashes

47/68

Page 48: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Cons Will overestimate results

Many L-intersections will share strings Edit distance length filter is loose

48/68

Page 49: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

OptEQ – wild-card n-grams (LNS07)

Use extended n-grams: Introduce wild-card symbol ‘?’ E.g., “ab?” can be:

- “aba”, “abb”, “abc”, … Build an extended n-gram table:

Extract all 1-grams, 2-grams, …, n-grams Generalize to extended 2-grams, …, n-grams Maintain an extended n-grams/frequency

hashtable49/68

Page 50: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Example

stringDataset

abcdefghi…

n-gram Frequencyn-gram table

abbcdeefghhi…?ba??c…

101541212…131723…

abcdef…

52…

50/68

Page 51: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Query Expansion (Replacements only) Given query q=“abcd” δ=2 And replacements only:

Base strings:- “??cd”, “?b?d”, “?bc?”, “a??d”, “a?c?”, “ab??”

Query answer:- S1={sD: s ”??cd”}, S2=…- A = |S1 S2 S3 S4 S5 S6| =

Σ1n6 (-1)n-1 |S1 … Sn|51/68

Page 52: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Replacement Intersection Lattice

A = Σ1n6 (-1)n-1 |S1 … Sn|

Need to evaluate size of all 2-intersections, 3-intersections, …, 6-intersections

Then, use n-gram table to compute sum A Exponential number of intersections But ... there is well-defined structure

52/68

Page 53: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Replacement Lattice Build replacement lattice:

Many intersections are empty Others produce the same results

we need to count everything only once

??cd ?b?d ?bc? a??d a?c? ab??

?bcd a?cd ab?d abc?

abcd

2 ‘?’

1 ‘?’

0 ‘?’

53/68

Page 54: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

General Formulas Similar reasoning for:

r replacements d deletions

Other combinations difficult: Multiple insertions Combinations of insertions/replacements

But … we can generate the corresponding lattice algorithmically! Expensive but possible

54/68

Page 55: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

BasicEQ Partition strings by length:

Query q with length l Possible matching strings with lengths:

- [l-δ, l+δ] For k = l-δ to l+δ

- Find all combinations of i+d+r = δ and l+i-d=k- If (i,d,r) is a special case use formula- Else generate lattice incrementally:

Start from query base strings (easy to generate) Begin with 2-intersections and build from there

55/68

Page 56: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

OptEq Details are cumbersome

Left for homework Various optimizations possible to reduce

complexity

56/68

Page 57: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Cons Fairly complicated implementation Expensive Works for small edit distance only

57/68

Page 58: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Hashed Sampling (HYKS08)

Used to estimate selectivity of TF/IDF, BM25, DICE (vector space model)

Main idea: Take a sample of the inverted index But do it intelligently to improve variance

58/68

Page 59: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Example

1

04 2

1 2

4

304 2

atchckicristtatituuc

23 10 1

4433

24

1

21

0

Take a sample of the inverted index

59/68

Page 60: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Example (Cont.)

1

0

00 1

1

133

4

4 2

atchckicristtatituuc

23

44

4 2

2

3

But do it intelligently to improve variance

1

1

1

00

01

60/68

Page 61: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Construction Draw samples deterministically:

Use a hash function h: N [0, 100] Keep ids that hash to values smaller than σ

Invariant: If a given id is sampled in one list, it will always be

sampled in all other lists that contain it:- S(q, s) can be computed directly from the sample- No need to store complete sets in the sample- No need for extra I/O to compute scores

61/68

Page 62: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Selectivity Estimation The union of arbitrary list samples is an σ% sample Given query q = {t1, …, tm}:

A = |Aσ| |t1 … tm| / |tσ1 … tσm|:- Aσ is the query answer size from the sample- The fraction is the actual scale-up factor- But there are duplicates in these unions!

We need to know:- The distinct number of ids in t1 … tm

- The distinct number of ids in tσ1 … tσm

62/68

Page 63: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Count Distinct Distinct |tσ1 … tσm| is easy:

Scan the sampled lists Distinct |t1 … tm| is hard:

Scanning the lists is the same as computing the exact answer to the query … naively

We are lucky:- Each list sample doubles up as a k-minimum value

estimator by construction!- We can use the list samples to estimate the distinct |t1

… tm|

63/68

Page 64: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

The k-Minimum Value Synopsis It is used to estimated the distinct size of

arbitrary set unions (the same as FM sketch): Take hash function h: N [0, 100] Hash each element of the set The r-th smallest hash value is an unbiased

estimator of count distinct:

0 100hr

r

hr r100 ?

64/68

Page 65: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Outline Motivation and preliminaries Inverted list based algorithms Gram signature algorithms Length normalized algorithms Selectivity estimation Conclusion and future directions

65/68

Page 66: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

Future Directions Result ranking

In practice need to run multiple types of searches Need to identify the “best” results

Diversity of query results Some queries have multiple meanings E.g., “Jaguar”

Updates Incremental maintenance

66/68

Page 67: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

References [AGK06] Arvind Arasu, Venkatesh Ganti, Raghav Kaushik: Efficient Exact Set-Similarity Joins. VLDB

2006 [BJL+09] Space-Constrained Gram-Based Indexing for Efficient Approximate String Search,

Alexander Behm, Shengyue Ji, Chen Li, and Jiaheng Lu, ICDE 2009 [HCK+08] Marios Hadjieleftheriou, Amit Chandel, Nick Koudas, Divesh Srivastava: Fast Indexes and

Algorithms for Set Similarity Selection Queries. ICDE 2008 [HYK+08] Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, Divesh Srivastava: Hashed samples:

selectivity estimators for set similarity selection queries. PVLDB 2008. [JL05] Selectivity Estimation for Fuzzy String Predicates in Large Data Sets, Liang Jin, and Chen Li.

VLDB 2005. [KSS06] Record linkage: Similarity measures and algorithms. Nick Koudas, Sunita Sarawagi, and

Divesh Srivastava. SIGMOD 2006. [LLL08] Efficient Merging and Filtering Algorithms for Approximate String Searches, Chen Li, Jiaheng

Lu, and Yiming Lu. ICDE 2008. [LNS07] Hongrae Lee, Raymond T. Ng, Kyuseok Shim: Extending Q-Grams to Estimate Selectivity of

String Matching with Low Edit Distance. VLDB 2007 [LWY07] VGRAM: Improving Performance of Approximate Queries on String Collections Using

Variable-Length Grams, Chen Li, Bin Wang, and Xiaochun Yang. VLDB 2007 [MBK+07] Arturas Mazeika, Michael H. Böhlen, Nick Koudas, Divesh Srivastava: Estimating the

selectivity of approximate string queries. ACM TODS 2007 [XWL08] Chuan Xiao, Wei Wang, Xuemin Lin: Ed-Join: an efficient algorithm for similarity joins with

edit distance constraints. PVLDB 2008

67/68

Page 68: Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li

References [XWL+08] Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu: Efficient similarity joins for near

duplicate detection. WWW 2008. [YWL08] Cost-Based Variable-Length-Gram Selection for String Collections to Support

Approximate Queries Efficiently, Xiaochun Yang, Bin Wang, and Chen Li, SIGMOD 2008 [JLV08]L. Jin, C. Li, R. Vernica: SEPIA: Estimating Selectivities of Approximate String Predicates

in Large Databases, VLDBJ08 [CGK06] S. Chaudhuri, V. Ganti, R. Kaushik : A Primitive Operator for Similarity Joins in Data

Cleaning, ICDE06 [CCGX08]K. Chakrabarti, S. Chaudhuri, V. Ganti, D. Xin: An Efficient Filter for Approximate

Membership Checking, SIGMOD08 [SK04] Sunita Sarawagi, Alok Kirpal: Efficient set joins on similarity predicates. SIGMOD

Conference 2004: 743-754 [BK02] Jérémy Barbay, Claire Kenyon: Adaptive intersection and t-threshold problems. SODA

2002: 390-399 [CGG+05] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rahul Kapoor, Vivek R. Narasayya,

Theo Vassilakis: Data cleaning in microsoft SQL server 2005. SIGMOD Conference 2005: 918-920

68/68