efficient top-k algorithms for fuzzy search in string collections

56
Efficient Top-k Algorithms for Fuzzy Search in String Collections Rares Vernica Chen Li Department of Computer Science University of California, Irvine First International Workshop on Keyword Search on Structured Data, 2009 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 1 / 17

Upload: rvernica

Post on 19-Jun-2015

1.702 views

Category:

Education


0 download

DESCRIPTION

An approximate search query on a collection of strings finds those strings in the collection that are similar to a given query string, where similarity is defined using a given similarity function such as Jaccard, cosine, and edit distance. Answering approximate queries efficiently is important in many applications such as search engines, data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. In this paper, we study the problem of efficiently computing the best answers to an approximate string query, where the quality of a string is based on both its importance and its similarity to the query string. We first develop a progressive algorithm that answers a ranking query by using the results of several approximate range queries, leveraging existing search techniques. We then develop efficient algorithms for answering ranking queries using indexing structures of gram-based inverted lists. We answer a ranking query by traversing the inverted lists, pruning and skipping irrelevant string ids, iteratively increasing the pruning and skipping power, and doing early termination. We have conducted extensive experiments on real datasets to evaluate the proposed algorithms and report our findings.

TRANSCRIPT

  • 1. Efcient Top-k Algorithms for Fuzzy Search in String Collections Rares Vernica Chen Li Department of Computer ScienceUniversity of California, IrvineFirst International Workshop on Keyword Search on StructuredData, 2009 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 1 / 17

2. Outline1 Motivation2 Efcient Top-k AlgorithmsProblem FormulationAlgorithms OverviewTop-k Single-Pass Search Algorithm3 Experimental EvaluationRares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 2 / 17 3. Need for Approximate String Queries ID FirstNameLastName # Movies 10 Al Swartzberg1 11 HannaWartenegg 1 12 RikSwartzwelder 30 13 Joey Swartzentruber1 14 Rene Swartenbroekx 4 15 Arnold Schwarzenegger 283 16 LucSwartenbroeckx1.. .. ... . .Figure: Actor names and popularitiesRares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17 4. Need for Approximate String Queries ID FirstNameLastName # Movies 10 Al Swartzberg1 11 HannaWartenegg 1 12 RikSwartzwelder 30 13 Joey Swartzentruber1 14 Rene Swartenbroekx 4 15 Arnold Schwarzenegger 283 16 LucSwartenbroeckx1.. .. ... . .Figure: Actor names and popularities SELECT * FROM Actors WHERE LastName = Shwartzenetrugger ORDER BY # Movies DESC LIMIT 1;Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17 5. Need for Approximate String Queries ID FirstNameLastName # Movies 10 Al Swartzberg1 11 HannaWartenegg 1 12 RikSwartzwelder 30 13 Joey Swartzentruber1 14 Rene Swartenbroekx 4 15 Arnold Schwarzenegger 283 16 LucSwartenbroeckx1.. .. ... . .Figure: Actor names and popularities SELECT * FROM Actors WHERE LastName = Shwartzenetrugger ORDER BY # Movies DESC LIMIT 1; 0 ResultsRares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17 6. Need for Ranking ID FirstNameLastName# Movies ED 10 Al Swartzberg 18 11 HannaWartenegg18 12 RikSwartzwelder308 13 Joey Swartzentruber 14 14 Rene Swartenbroekx49 15 Arnold Schwarzenegger283 5 16 LucSwartenbroeckx 19.. ... . . .. .. .Figure: Actor names, popularities, and edit distances to a query stringShwartzenetrugger. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17 7. Need for Ranking ID FirstNameLastName# Movies ED 10 Al Swartzberg 18 11 HannaWartenegg18 12 RikSwartzwelder308 13 Joey Swartzentruber 14 14 Rene Swartenbroekx49 15 Arnold Schwarzenegger283 5 16 LucSwartenbroeckx 19.. ... . . .. .. .Figure: Actor names, popularities, and edit distances to a query stringShwartzenetrugger. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17 8. Need for Ranking ID FirstNameLastName# Movies ED 10 Al Swartzberg 18 11 HannaWartenegg18 12 RikSwartzwelder308 13 Joey Swartzentruber 14 14 Rene Swartenbroekx49 15 Arnold Schwarzenegger283 5 16 LucSwartenbroeckx 19.. ... . . .. .. .Figure: Actor names, popularities, and edit distances to a query stringShwartzenetrugger. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17 9. Need for Ranking ID FirstNameLastName# Movies ED 10 Al Swartzberg 18 11 HannaWartenegg18 12 RikSwartzwelder308 13 Joey Swartzentruber 14 14 Rene Swartenbroekx49 15 Arnold Schwarzenegger283 5 16 LucSwartenbroeckx 19.. ... . . .. .. .Figure: Actor names, popularities, and edit distances to a query stringShwartzenetrugger. Which one result should the system return? Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17 10. Need for Ranking ID FirstNameLastName# Movies ED 10 Al Swartzberg 18 11 HannaWartenegg18 12 RikSwartzwelder308 13 Joey Swartzentruber 14 14 Rene Swartenbroekx49 15 Arnold Schwarzenegger283 5 16 LucSwartenbroeckx 19.. ... . . .. .. .Figure: Actor names, popularities, and edit distances to a query stringShwartzenetrugger. Which one result should the system return? Which value is more important, # Movies or similarity? Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17 11. Top-k Similar StringsGiven:Weighted string collectione.g., actors LastName and # MoviesQuery stringe.g., ShwartzenetruggerSimilarity functione.g, Edit DistanceScoring function (score of a string in terms of similarity and weight)e.g., linear combination of similarity and popularityInteger kReturn: k best strings in terms of overall score to the query string.Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17 12. Top-k Similar StringsGiven:Weighted string collectione.g., actors LastName and # MoviesQuery stringe.g., ShwartzenetruggerSimilarity functione.g, Edit DistanceScoring function (score of a string in terms of similarity and weight)e.g., linear combination of similarity and popularityInteger kReturn: k best strings in terms of overall score to the query string.Advantages over Range Search:specify k instead of a similarity thresholdguaranteed k results; a range search might have 0 resultsRares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17 13. Algorithms Overview Range SearchRangeSearchIterative RangeSingle-PassSearch Search Two-Phase SearchRares Vernica (UC Irvine) Top-k Fuzzy String SearchKEYS 2009 6 / 17 14. q-grams: Overlapping substrings of xed lengthFind similar strings: e.g., Vernica and Veronicaq-gram: substring of length q of a string: e.g., q = 2Rares Vernica (UC Irvine) Top-k Fuzzy String SearchKEYS 2009 7 / 17 15. q-grams: Overlapping substrings of xed lengthFind similar strings: e.g., Vernica and Veronicaq-gram: substring of length q of a string: e.g., q = 2VernicaRares Vernica (UC Irvine) Top-k Fuzzy String SearchKEYS 2009 7 / 17 16. q-grams: Overlapping substrings of xed lengthFind similar strings: e.g., Vernica and Veronicaq-gram: substring of length q of a string: e.g., q = 2Vernica {VeRares Vernica (UC Irvine) Top-k Fuzzy String SearchKEYS 2009 7 / 17 17. q-grams: Overlapping substrings of xed lengthFind similar strings: e.g., Vernica and Veronicaq-gram: substring of length q of a string: e.g., q = 2Vernica {Ve,erRares Vernica (UC Irvine) Top-k Fuzzy String SearchKEYS 2009 7 / 17 18. q-grams: Overlapping substrings of xed lengthFind similar strings: e.g., Vernica and Veronicaq-gram: substring of length q of a string: e.g., q = 2Vernica {Ve,er,rnRares Vernica (UC Irvine) Top-k Fuzzy String SearchKEYS 2009 7 / 17 19. q-grams: Overlapping substrings of xed lengthFind similar strings: e.g., Vernica and Veronicaq-gram: substring of length q of a string: e.g., q = 2Vernica {Ve,er,rn,niRares Vernica (UC Irvine) Top-k Fuzzy String SearchKEYS 2009 7 / 17 20. q-grams: Overlapping substrings of xed lengthFind similar strings: e.g., Vernica and Veronicaq-gram: substring of length q of a string: e.g., q = 2Vernica {Ve,er,rn,ni,icRares Vernica (UC Irvine) Top-k Fuzzy String SearchKEYS 2009 7 / 17 21. q-grams: Overlapping substrings of xed lengthFind similar strings: e.g., Vernica and Veronicaq-gram: substring of length q of a string: e.g., q = 2Vernica {Ve,er,rn,ni,ic,ca}Rares Vernica (UC Irvine) Top-k Fuzzy String SearchKEYS 2009 7 / 17 22. q-grams: Overlapping substrings of xed lengthFind similar strings: e.g., Vernica and Veronicaq-gram: substring of length q of a string: e.g., q = 2Vernica {Ve,er,rn,ni,ic,ca}Veronica {Ve,er,ro,on,ni,ic,ca}Rares Vernica (UC Irvine) Top-k Fuzzy String SearchKEYS 2009 7 / 17 23. q-grams: Overlapping substrings of xed lengthFind similar strings: e.g., Vernica and Veronicaq-gram: substring of length q of a string: e.g., q = 2Vernica {Ve,er,rn,ni,ic,ca}Veronica {Ve,er,ro,on,ni,ic,ca}Similar strings share a certain number of gramsRares Vernica (UC Irvine) Top-k Fuzzy String SearchKEYS 2009 7 / 17 24. q-gram Inverted List Indexq=2 IDString Weight1ab0.802ccd 0.703cd0.604abcd0.505bcc 0.40 Figure: DatasetRares Vernica (UC Irvine)Top-k Fuzzy String Search KEYS 2009 8 / 17 25. q-gram Inverted List Indexq=2 IDString Weight1ab0.80 abcccdbc23 ccd cd 0.70 0.60 1 425 2 3 4 54abcd0.5045bcc 0.40Figure: Gram inverted-list index Figure: DatasetRares Vernica (UC Irvine)Top-k Fuzzy String Search KEYS 2009 8 / 17 26. q-gram Inverted List Indexq=2 IDString Weight1ab0.80 abcccdbc2ccd 0.70122 43cd0.60453 54abcd0.5045bcc 0.40Figure: Gram inverted-list index Figure: DatasetQuery string: bcdRares Vernica (UC Irvine)Top-k Fuzzy String Search KEYS 2009 8 / 17 27. q-gram Inverted List Indexq=2 IDString Weight1ab0.80 abcccdbc2ccd 0.70122 43cd0.60453 54abcd0.5045bcc 0.40Figure: Gram inverted-list index Figure: DatasetQuery string: bcdRares Vernica (UC Irvine)Top-k Fuzzy String Search KEYS 2009 8 / 17 28. q-gram Inverted List Indexq=2 IDString Weight1ab0.80 abcccdbc23 ccd cd 0.70 0.60 1 425 2 3 4 54abcd0.5045bcc 0.40Figure: Gram inverted-list index Figure: DatasetQuery string: bcdIdentied strings are veried by computing the real similarity.Verication is usually an expensive process.Rares Vernica (UC Irvine)Top-k Fuzzy String Search KEYS 2009 8 / 17 29. Top-k Single-pass Search Algorithm IDStringWeight1ab 0.802ccd0.703cd 0.604abcd 0.505bcc0.40 Figure: Datasetab cc cdbc 1 2 24 4 5 35 4Figure: Gram inverted-listindexRares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17 30. Top-k Single-pass Search Algorithm IDStringWeight1ab 0.802ccd0.70Setup 3cd 0.60 Assign IDs s.t.4abcd 0.50 ascending order of IDs 5bcc0.40 descending order of weights Figure: Datasetab cc cdbc 122 4 453 5 4Figure: Gram inverted-listindexRares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17 31. Top-k Single-pass Search Algorithm IDStringWeight1ab 0.802ccd0.70Setup 3cd 0.60 Assign IDs s.t.4abcd 0.50 ascending order of IDs 5bcc0.40 descending order of weights Figure: Dataset Sort the IDs on each list in ascending orderab cc cdbc 122 4 453 5 4Figure: Gram inverted-listindexRares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17 32. Top-k Single-pass Search Algorithm IDStringWeight1ab 0.802ccd0.70Setup 3cd 0.60 Assign IDs s.t.4abcd 0.50 ascending order of IDs 5bcc0.40 descending order of weights Figure: Dataset Sort the IDs on each list in ascending order Scan the lists corresponding to theab cc cdbc grams in the query. 122 4 e.g., bcd 453 5 4Figure: Gram inverted-listindexRares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17 33. Top-k Single-pass Search AlgorithmIDStringWeight 1ab 0.80 2ccd0.70 3cd 0.60Nave approach: Round-Robin 4abcd 0.50Scan all the lists in the same time5bcc0.40Maintain a list of open IDs (mightFigure: Datasetstill appear on some of the lists)Store the best k closed IDs in atop-k buffer ab cc cdbcStop when the top-k buffer cannot 122 4improve 453 54 Figure: Gram inverted-list index Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17 34. Top-k Single-pass Search Algorithm ab bc cd122Heap-based 2034 Traverse the lists in a sorted2145 order using a heap on the top . .... . IDs of the lists...19 19 Advantages:20 20 1 No need to maintain the list.. of open IDs . ... 2 Skip elements Figure: Gram inverted-lists for query abcdRares Vernica (UC Irvine)Top-k Fuzzy String SearchKEYS 2009 10 / 17 35. Top-k Single-pass Search Algorithm ab bc cd 122Heap-based 2034 Traverse the lists in a sorted2145 order using a heap on the top . .... . IDs of the lists...19 19 Advantages:120 20 1 No need to maintain the list.. of open IDs . ... 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- abcd heapRares Vernica (UC Irvine)Top-k Fuzzy String SearchKEYS 2009 10 / 17 36. Top-k Single-pass Search Algorithm abbccd 12 2Heap-based 20 3 4 Traverse the lists in a sorted21 4 5 order using a heap on the top . . . . . . IDs of the lists. . .19 19 Advantages:120 20 1 No need to maintain the list.. 2 of open IDs . ... 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- abcd heapRares Vernica (UC Irvine)Top-k Fuzzy String SearchKEYS 2009 10 / 17 37. Top-k Single-pass Search Algorithm abbc cd 12 2Heap-based 20 34 Traverse the lists in a sorted21 45 order using a heap on the top . . . ... IDs of the lists. ..19 19 Advantages:120 20 1 No need to maintain the list.. 2 of open IDs . ... 2 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- abcd heapRares Vernica (UC Irvine)Top-k Fuzzy String SearchKEYS 2009 10 / 17 38. Top-k Single-pass Search Algorithm abbc cd 12 2Heap-based 20 34 Traverse the lists in a sorted21 45 order using a heap on the top . . . ... IDs of the lists. ..19 19 Advantages:120 20 1 No need to maintain the list.. 2 of open IDs . ... 2 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- abcd heapRares Vernica (UC Irvine)Top-k Fuzzy String SearchKEYS 2009 10 / 17 39. Top-k Single-pass Search Algorithm1 abbc cdFigure: 12 2Top-kHeap-based 20 34buffer, Traverse the lists in a sorted21 45k =1 order using a heap on the top . . . ... IDs of the lists. ..19 19 Advantages:220 20 1 No need to maintain the .. 2 list of open IDs. ... 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- abcd heapRares Vernica (UC Irvine)Top-k Fuzzy String SearchKEYS 2009 10 / 17 40. Top-k Single-pass Search Algorithm1ab bc cdFigure: 1 2 2Top-kHeap-based 2034buffer, Traverse the lists in a sorted 2145k =1 order using a heap on the top... ... IDs of the lists ...19 19 Advantages: 220 20 1 No need to maintain the list..2 of open IDs . ... 20 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- abcd heapRares Vernica (UC Irvine)Top-k Fuzzy String SearchKEYS 2009 10 / 17 41. Top-k Single-pass Search Algorithm1ab bc cdFigure: 1 2 2Top-kHeap-based 2034buffer, Traverse the lists in a sorted 2145k =1 order using a heap on the top... ... IDs of the lists ...19 19 Advantages: 220 20 1 No need to maintain the list..2 of open IDs . ... 20 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- abcd heapRares Vernica (UC Irvine)Top-k Fuzzy String SearchKEYS 2009 10 / 17 42. Top-k Single-pass Search Algorithm2ab bc cdFigure: 1 2 2Top-kHeap-based 2034buffer, Traverse the lists in a sorted 2145k =1 order using a heap on the top... ... IDs of the lists ...19 19 Advantages:2020 20 1 No need to maintain the list.. of open IDs . ... 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- abcd heapRares Vernica (UC Irvine)Top-k Fuzzy String SearchKEYS 2009 10 / 17 43. Top-k Single-pass Search Algorithm2ab bc cdFigure: 122Top-kHeap-based 20 3 4buffer, Traverse the lists in a sorted 2145k =1 order using a heap on the top... ... IDs of the lists ...19 19 Advantages: 320 20 1 No need to maintain the list..4 of open IDs . ... 20 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- abcd heapRares Vernica (UC Irvine)Top-k Fuzzy String SearchKEYS 2009 10 / 17 44. Top-k Single-pass Search Algorithm2ab bc cdFigure: 122Top-kHeap-based 20 3 4buffer, Traverse the lists in a sorted 2145k =1 order using a heap on the top... ... IDs of the lists ...19 19 Advantages: 320 20 1 No need to maintain the list..4 of open IDs . ... 20 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- abcd heapRares Vernica (UC Irvine)Top-k Fuzzy String SearchKEYS 2009 10 / 17 45. Top-k Single-pass Search Algorithm2ab bc cdFigure: 122Top-kHeap-based 20 3 4buffer, Traverse the lists in a sorted 2145k =1 order using a heap on the top... ... IDs of the lists ...19 19 Advantages: 420 20 1 No need to maintain the list.. 20 of open IDs . ... 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- abcd heapRares Vernica (UC Irvine)Top-k Fuzzy String SearchKEYS 2009 10 / 17 46. Top-k Single-pass Search Algorithm2ab bc cdFigure: 122Top-kHeap-based 20 3 4buffer, Traverse the lists in a sorted 2145k =1 order using a heap on the top... ... IDs of the lists ...19 19 Advantages: 420 20 1 No need to maintain the list.. 20 of open IDs . ... 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- abcd heapRares Vernica (UC Irvine)Top-k Fuzzy String SearchKEYS 2009 10 / 17 47. Top-k Single-pass Search Algorithm2ab bc cdFigure: 122Top-kHeap-based 20 3 4buffer, Traverse the lists in a sorted 2145k =1 order using a heap on the top... ... IDs of the lists ...19 19 Advantages:2020 20 1 No need to maintain the list.. of open IDs . ... 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- abcd heapRares Vernica (UC Irvine)Top-k Fuzzy String SearchKEYS 2009 10 / 17 48. Top-k Single-pass Search Algorithm2abbc cdFigure: 1 22 Top-kHeap-based 20 34 buffer, Traverse the lists in a sorted 21 45 k =1 order using a heap on the top.. ... . IDs of the lists . ..19 19 Advantages:20 2020 1 No need to maintain the list.. of open IDs . ... 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- abcd heapRares Vernica (UC Irvine)Top-k Fuzzy String SearchKEYS 2009 10 / 17 49. Algorithms Overview Range SearchRangeSearchIterative RangeSingle-PassSearch Search Two-Phase SearchRares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 11 / 17 50. Experimental Setting Datasets:IMDB Actor Names1actor names and the number of movies they played in1.2 million actors, average name length 15weight is the number of movies (log normalized)WEB Corpus Word Grams2sequences of English words and their frequency on the Web2.4 million sequences, average sequence length 20weight is the frequency (log normalized) Jaccard similarity and normalized edit similarity, q = 3 Index and data are stored in main memory at all times Implemented in C++ (GNU compiler) on Ubuntu Linux OS Intel 2.40GHz PC, 2GB RAM1http://www.imdb.com/interfaces2http://www.ldc.upenn.edu/Catalog LDC2006T13Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 12 / 17 51. Benets of Skipping Elements 50SPS 40 SPS*Average running time for Time (ms) 30 top-10 queries. IMDBdataset with Jaccardsimilarity. Single-Pass 20Search (SPS) algorithmand SPS without 10 skipping (SPS*). 00.2 0.4 0.6 0.81.0 1.2Dataset Size (millions)Rares Vernica (UC Irvine) Top-k Fuzzy String SearchKEYS 2009 13 / 17 52. Potential of the Two-Phase Algorithm 40 30 Running time for 3 Time (ms)top-10 queries. WEBCorpus dataset with 20 normalized editsimilarity. Two-Phasealgorithm with different 10initial thresholds. 0Q1 Q2Q3QueriesRares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 14 / 17 53. Optimum Initial Threshold for the Two-Phase Algorithm 100 Average running time for top-10 queries. Web80 Corpus dataset with normalized edit Time (ms)60 similarity. Single-Pass Search (SPS) algorithm, Two-Phase (2PH)40 algorithm, 2PH with the SPS optimum initial threshold20 2PH (2PH Opt). 2PH Opt The Iterative Range Search0algorithm was to expensive to be plotted. The average running time 0.40.8 1.2 1.6 2.02.4 was around 5s. Dataset Size (millions)Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 200915 / 17 54. Summary Approximate ranking queries in string collections Useful when mismatch between query and data Propose three approaches to solve the problem:1Use existing approximate range search algorithms as a black box Proves to be the most expensive2Use particularities of the top-k problem Proves to be very efcient3Combine (1) and (2) sequentially Proves to be slightly more efcient Rares Vernica (UC Irvine) Top-k Fuzzy String SearchKEYS 2009 16 / 17 55. The Flamingo ProjectThis work is part of The Flamingo Project at UC Irvinehttp://flamingo.ics.uci.eduRares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 17 / 17 56. A Quick Note on Related WorkFagin et. al [1] Our Setting similarity on multiple similarity on one string numerical attributes attribute traverse list of IDs traverse list of IDs one list per attribute one list per q-gram lists sorted on similarity tolists sorted on global weight that attribute lists have different orders of lists have the same order of IDsIDs all IDs appear on all the listsa subset of IDs appear oneach list[1] R.Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleware.In PODS, 2001 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 17 / 17