fast indexes and algorithms for set similarity selection queries
DESCRIPTION
Fast Indexes and Algorithms For Set Similarity Selection Queries. M. Hadjieleftheriou Chandel N. Koudas D. Srivastava. Strings as sets. s 1 = “Main St. Maine”: ‘Main’ ‘St.’ ‘Maine’ ‘Mai’ ‘ain’ ‘in ’ ‘n S’ ‘ St’ ‘St.’ ‘t. ’ … s 2 = “Main St. Main”: ‘Main’ ‘St.’ ‘Main’ - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/1.jpg)
Fast Indexes and AlgorithmsFor Set Similarity Selection Queries
M. HadjieleftheriouA. ChandelN. KoudasD. Srivastava
![Page 2: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/2.jpg)
Strings as sets
s1 = “Main St. Maine”:• ‘Main’ ‘St.’ ‘Maine’• ‘Mai’ ‘ain’ ‘in ’ ‘n S’ ‘ St’ ‘St.’ ‘t. ’ …
s2 = “Main St. Main”:• ‘Main’ ‘St.’ ‘Main’
How similar is s1 and s2 ?
![Page 3: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/3.jpg)
TF/IDF weighted similarity
Inverse Document Frequency (idf):• ‘Main’ is common• ‘Maine’ is not
• idf(t) = log2[1 + N / df(t)]
Term Frequency (tf):• ‘Main’ appears twice in s2
Similarity:• Inner Product
![Page 4: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/4.jpg)
Is TF important?
Information retrieval:• Given a query string retrieve relevant
documents
Relational databases:• Given a query string retrieve relevant strings
In practice TF is small in many applications
![Page 5: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/5.jpg)
IDF similarity
Query q = {t1, …, tn}
Set s = {r1, …, rm}
Length len(s) = (t 2 s idf(t)2)1/2
I(q, s) = t 2 s \ q idf(t)2 / len(s) len(q)
IDF is as good as TF/IDF in practice!
![Page 6: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/6.jpg)
How can I build an index?
Let w(t, s) = idf(t) / len(s)
Then I(q, s) = t 2 q \ s w(t, s) w(t, q)
So• Decompose strings into tokens• Compute the idf of each token• Create one inverted list per token
Sort lists by string id: Do a merge join
Sort lists by w: Run TA/NRA
![Page 7: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/7.jpg)
Example: Sort by id
![Page 8: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/8.jpg)
Example: Sort by w
NRA:• Round robin list accesses• Main memory hash table• Computes lower and upper bounds per entry
![Page 9: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/9.jpg)
Semantic properties of IDF
Order Preservation:• For all t1 t2: if w(t1, s) < w(t1, r), then w(t2, s) <
w(t2, r)
Length Boundedness:• Query q, set s, threshold
– I(q, s) >= ) len(q) < len(s) < len(q) /
![Page 10: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/10.jpg)
Improved NRA
Order Preservation determines if a given set appears in a list or not• ti: encounter s1, then s2
• tk: encounter s2 first
Length Boundedness restricts the search in a small portion of lists
![Page 11: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/11.jpg)
Something surprising
Lemma: NRA reads arbitrarily more elements than iNRA
Lemma: NRA reads arbitrarily more elements than any algorithm that uses the Length Boundedness property
![Page 12: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/12.jpg)
Any other strategies?
NRA style is breadth-first
Try depth-first:• Sort query lists in decreasing idf order
– Let q = {t1, …, tn} and idf(t1) > idf(t2) > …> idf(tn)
• Let i be the maximum length a set s in ti can have s.t. I(q, s) >= , assuming that s exists in all tk > ti
– i = I <= k <= n idf(tk)2 / len(q)
• i is a natural cutoff point
• 1 > 2 > … > n
![Page 13: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/13.jpg)
Shortest-First
Sort q={t1, …, tn} in decreasing idf order
Let candidate set C
For 1 <= i <= n• Skip to first entry with len(s) >= len(q)• Compute i
• Let i = min(i, len(q) / )• Repeat
– s = pop next element from ti
– Maintain lower/upper bounds of entries in C
• Until len(s) > max(max len C, i)
![Page 14: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/14.jpg)
Comparison with NRA
Lemma: Let q={t1, …, tn} and d the maximum depth SF descents over all lists. In the worst case iNRA will read (d – 1)(n – 1) elements more than SF
But surprisingly
![Page 15: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/15.jpg)
A hybrid strategy
Run iNRA normally
Use i and max len C to stop reading from a particular list• This guarantees that iNRA stops with or before
SF
Drawback of NRA variants:• Very high book keeping cost compared to SF
![Page 16: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/16.jpg)
Experiments
DBLP, IMDB and YellowPages datasets
Actors, movies, authors, businesses etc.
Vary threshold, query size, query strings and mistakes
Test wall-clock time, pruning power
Algorithms:NRA, TA, iNRA, iTA, SF, Hybrid, Sort-by-id, Improved SQL based
![Page 17: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/17.jpg)
Wall-clock time vs. Threshold
![Page 18: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/18.jpg)
Wall-clock time vs. Query size
TA
NRA
Sort-by-id
iTA
SF
![Page 19: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/19.jpg)
Space
![Page 20: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/20.jpg)
Conclusion
Proposed a simplified TF/IDF measure
Identified strong monotonicity properties
Used the properties to design efficient algorithms
SF works best overall in practice• Achieves sub-second answers in most practical
cases
![Page 21: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/21.jpg)
Q&A
![Page 22: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/22.jpg)
Pruning power vs. Threshold
![Page 23: Fast Indexes and Algorithms For Set Similarity Selection Queries](https://reader034.vdocuments.us/reader034/viewer/2022051402/568157f3550346895dc57133/html5/thumbnails/23.jpg)
Pruning power vs. Query size
NRA TA
iTA