fast indexes and algorithms for set similarity selection queries

Fast Indexes and AlgorithmsFor Set Similarity Selection Queries

M. HadjieleftheriouA. ChandelN. KoudasD. Srivastava

Strings as sets

s1 = “Main St. Maine”:• ‘Main’ ‘St.’ ‘Maine’• ‘Mai’ ‘ain’ ‘in ’ ‘n S’ ‘ St’ ‘St.’ ‘t. ’ …

s2 = “Main St. Main”:• ‘Main’ ‘St.’ ‘Main’

How similar is s1 and s2 ?

TF/IDF weighted similarity

Inverse Document Frequency (idf):• ‘Main’ is common• ‘Maine’ is not

• idf(t) = log2[1 + N / df(t)]

Term Frequency (tf):• ‘Main’ appears twice in s2

Similarity:• Inner Product

Is TF important?

Information retrieval:• Given a query string retrieve relevant

documents

Relational databases:• Given a query string retrieve relevant strings

In practice TF is small in many applications

IDF similarity

Query q = {t1, …, tn}

Set s = {r1, …, rm}

Length len(s) = (t 2 s idf(t)2)1/2

I(q, s) = t 2 s \ q idf(t)2 / len(s) len(q)

IDF is as good as TF/IDF in practice!

How can I build an index?

Let w(t, s) = idf(t) / len(s)

Then I(q, s) = t 2 q \ s w(t, s) w(t, q)

So• Decompose strings into tokens• Compute the idf of each token• Create one inverted list per token

Sort lists by string id: Do a merge join

Sort lists by w: Run TA/NRA

Example: Sort by id

Example: Sort by w

NRA:• Round robin list accesses• Main memory hash table• Computes lower and upper bounds per entry

Semantic properties of IDF

Order Preservation:• For all t1 t2: if w(t1, s) < w(t1, r), then w(t2, s) <

w(t2, r)

Length Boundedness:• Query q, set s, threshold

– I(q, s) >= ) len(q) < len(s) < len(q) /

Improved NRA

Order Preservation determines if a given set appears in a list or not• ti: encounter s1, then s2

• tk: encounter s2 first

Length Boundedness restricts the search in a small portion of lists

Something surprising

Lemma: NRA reads arbitrarily more elements than iNRA

Lemma: NRA reads arbitrarily more elements than any algorithm that uses the Length Boundedness property

Any other strategies?

NRA style is breadth-first

Try depth-first:• Sort query lists in decreasing idf order

– Let q = {t1, …, tn} and idf(t1) > idf(t2) > …> idf(tn)

• Let i be the maximum length a set s in ti can have s.t. I(q, s) >= , assuming that s exists in all tk > ti

– i = I <= k <= n idf(tk)2 / len(q)

• i is a natural cutoff point

• 1 > 2 > … > n

Shortest-First

Sort q={t1, …, tn} in decreasing idf order

Let candidate set C

For 1 <= i <= n• Skip to first entry with len(s) >= len(q)• Compute i

• Let i = min(i, len(q) / )• Repeat

– s = pop next element from ti

– Maintain lower/upper bounds of entries in C

• Until len(s) > max(max len C, i)

Comparison with NRA

Lemma: Let q={t1, …, tn} and d the maximum depth SF descents over all lists. In the worst case iNRA will read (d – 1)(n – 1) elements more than SF

But surprisingly

A hybrid strategy

Run iNRA normally

Use i and max len C to stop reading from a particular list• This guarantees that iNRA stops with or before

SF

Drawback of NRA variants:• Very high book keeping cost compared to SF

Experiments

DBLP, IMDB and YellowPages datasets

Actors, movies, authors, businesses etc.

Vary threshold, query size, query strings and mistakes

Test wall-clock time, pruning power

Algorithms:NRA, TA, iNRA, iTA, SF, Hybrid, Sort-by-id, Improved SQL based

Wall-clock time vs. Threshold

Wall-clock time vs. Query size

TA

NRA

Sort-by-id

iTA

SF

Conclusion

Proposed a simplified TF/IDF measure

Identified strong monotonicity properties

Used the properties to design efficient algorithms

SF works best overall in practice• Achieves sub-second answers in most practical

cases

Pruning power vs. Threshold

Pruning power vs. Query size

NRA TA

iTA

fast indexes and algorithms for set similarity selection queries

Documents