efï¬cient top-k algorithms for approximate substring matching

12
Efficient Top-k Algorithms for Approximate Substring Matching Younghoon Kim Seoul National University Seoul, Korea [email protected] Kyuseok Shim Seoul National University Seoul, Korea [email protected] ABSTRACT There is a wide range of applications that require to query a large database of texts to search for similar strings or substrings. Traditional approximate substring matching re- quests a user to specify a similarity threshold. Without top- k approximate substring matching, users have to try repeat- edly different maximum distance threshold values when the proper threshold is unknown in advance. In our paper, we first propose the efficient algorithms for finding the top-k approximate substring matches with a given query string in a set of data strings. To reduce the number of expensive distance computations, the proposed algorithms utilize our novel filtering techniques which take advantages of q-grams and inverted q-gram indexes avail- able. We conduct extensive experiments with real-life data sets. Our experimental results confirm the effectiveness and scalability of our proposed algorithms. Categories and Subject Descriptors H.2 [Database Management]: Systems—query process- ing, textual databases Keywords Top-k approximate substring matching; edit distance; in- verted q-gram index 1. INTRODUCTION There is a wide range of applications that require to query a large database of texts to search for similar strings or sub- strings. A list of possible applications includes substring matching or short snippet suggestions for web search [23], named entity recognition [27], bio-informatics for finding DNA subsequences [20], music data retrieval [18] and spell checking [26]. These applications generally require high per- formance for real-time query processing to find similar sub- strings from text databases. While these applications are not all new, as there is an increasing trend of applications Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD’13, June 22–27, 2013, New York, New York, USA. Copyright 2013 ACM 978-1-4503-2037-5/13/06 ...$15.00. to deal with vast amounts of data, retrieving similar strings becomes a more challenging problem today. To handle approximate substring matching, various string (dis)similarity measures, such as edit distance, hamming dis- tance, Jaccard coefficient, cosine similarity and Jaro-Winkler distance have been considered [2, 10, 11, 19]. Among them, the edit distance is one of the most widely accepted distance measures for database applications where domain specific knowledge is not really available [10, 12, 20, 27]. Based on the edit distance, the approximate string match- ing problem is to find every string from a string database whose edit distance to the query string is not larger than a given maximum distance threshold. There are diverse appli- cations requiring approximate string matching. For exam- ple, it is useful to check whether the names or addresses pro- vided by users as input are correct by querying databases to search for the strings within a maximum distance threshold [29]. However, for databases with long strings, the contain- ment search of a query string called approximate substring matching is more reasonable. Id String Id String s1 Jackson Pollock s4 Jacksomville s2 Jakob Pollack s5 Jakson Pollack s3 Jason Polock s6 Mackson Polock Figure 1: An example of a string database D Two different problem definitions have been introduced for approximate substring matching. In the first definition, the problem is to discover all substrings of each string in a database, whose edit distances to the query string are within the given maximum threshold [4, 17, 27]. Consider a query string ‘Jackson’ and a database shown in Figure 1. If the maximum edit distance threshold is 2, with ‘Jackson Pollock’ alone, the substrings of ‘Jacks’, ‘Jackso’, ‘Jackson’, ‘ackson’ and ‘ckson’ are all included in the query result. Another popular definition of the approximate substring matching problem is to retrieve every string s whose sub- string edit distance to the query string σ is at most the given maximum threshold, where the substring edit distance be- tween σ and s is the smallest edit distance among the ones between σ and all substrings in s [13, 25]. For example, let us reconsider the string database in Figure 1 and the query string ‘Jackson’. If the maximum threshold is 2, the ap- proximate substring matches are {‘Jackson Pollock’, ‘Jason Polock’, ‘Jacksomville’, ‘Jakson Pollack’, ‘Mackson Polock’}. All previous studies of approximate string or substring matching mentioned so far require each user to provide a maximum distance threshold. However, in many applica- tions, it is very difficult for each user to know a proper 385

Upload: others

Post on 03-Feb-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efï¬cient Top-k Algorithms for Approximate Substring Matching

Efficient Top-k Algorithms for Approximate SubstringMatching

Younghoon KimSeoul National University

Seoul, [email protected]

Kyuseok ShimSeoul National University

Seoul, [email protected]

ABSTRACT

There is a wide range of applications that require to querya large database of texts to search for similar strings orsubstrings. Traditional approximate substring matching re-quests a user to specify a similarity threshold. Without top-k approximate substring matching, users have to try repeat-edly different maximum distance threshold values when theproper threshold is unknown in advance.In our paper, we first propose the efficient algorithms

for finding the top-k approximate substring matches witha given query string in a set of data strings. To reduce thenumber of expensive distance computations, the proposedalgorithms utilize our novel filtering techniques which takeadvantages of q-grams and inverted q-gram indexes avail-able. We conduct extensive experiments with real-life datasets. Our experimental results confirm the effectiveness andscalability of our proposed algorithms.

Categories and Subject Descriptors

H.2 [Database Management]: Systems—query process-ing, textual databases

Keywords

Top-k approximate substring matching; edit distance; in-verted q-gram index

1. INTRODUCTIONThere is a wide range of applications that require to query

a large database of texts to search for similar strings or sub-strings. A list of possible applications includes substringmatching or short snippet suggestions for web search [23],named entity recognition [27], bio-informatics for findingDNA subsequences [20], music data retrieval [18] and spellchecking [26]. These applications generally require high per-formance for real-time query processing to find similar sub-strings from text databases. While these applications arenot all new, as there is an increasing trend of applications

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGMOD’13, June 22–27, 2013, New York, New York, USA.Copyright 2013 ACM 978-1-4503-2037-5/13/06 ...$15.00.

to deal with vast amounts of data, retrieving similar stringsbecomes a more challenging problem today.

To handle approximate substring matching, various string(dis)similarity measures, such as edit distance, hamming dis-tance, Jaccard coefficient, cosine similarity and Jaro-Winklerdistance have been considered [2, 10, 11, 19]. Among them,the edit distance is one of the most widely accepted distancemeasures for database applications where domain specificknowledge is not really available [10, 12, 20, 27].

Based on the edit distance, the approximate string match-ing problem is to find every string from a string databasewhose edit distance to the query string is not larger than agiven maximum distance threshold. There are diverse appli-cations requiring approximate string matching. For exam-ple, it is useful to check whether the names or addresses pro-vided by users as input are correct by querying databases tosearch for the strings within a maximum distance threshold[29]. However, for databases with long strings, the contain-ment search of a query string called approximate substringmatching is more reasonable.

Id String Id Strings1 Jackson Pollock s4 Jacksomvilles2 Jakob Pollack s5 Jakson Pollacks3 Jason Polock s6 Mackson Polock

Figure 1: An example of a string database D

Two different problem definitions have been introducedfor approximate substring matching. In the first definition,the problem is to discover all substrings of each string in adatabase, whose edit distances to the query string are withinthe given maximum threshold [4, 17, 27]. Consider a querystring ‘Jackson’ and a database shown in Figure 1. If themaximum edit distance threshold is 2, with ‘Jackson Pollock’alone, the substrings of ‘Jacks’, ‘Jackso’, ‘Jackson’, ‘ackson’and ‘ckson’ are all included in the query result.

Another popular definition of the approximate substringmatching problem is to retrieve every string s whose sub-string edit distance to the query string σ is at most the givenmaximum threshold, where the substring edit distance be-tween σ and s is the smallest edit distance among the onesbetween σ and all substrings in s [13, 25]. For example, letus reconsider the string database in Figure 1 and the querystring ‘Jackson’. If the maximum threshold is 2, the ap-proximate substring matches are {‘Jackson Pollock’, ‘JasonPolock’, ‘Jacksomville’, ‘Jakson Pollack’, ‘Mackson Polock’}.

All previous studies of approximate string or substringmatching mentioned so far require each user to provide amaximum distance threshold. However, in many applica-tions, it is very difficult for each user to know a proper

385

Page 2: Efï¬cient Top-k Algorithms for Approximate Substring Matching

threshold in advance since it depends on the applicationscenarios. An appealing alternative method is to computethe most similar k strings or substrings without the need tospecify a maximum distance threshold. Without such top-kapproximate matching, users have to try different distancethreshold values, which may lead to empty results (if thethreshold chosen is too low) or too many results with a longrunning time (if the threshold is too high). It also supportsinteractive applications, where users are presented with top-k most similar strings or substrings progressively.In our paper, we propose the efficient algorithms for the

top-k approximate substring matching problem which findthe top-k strings whose substring edit distances to the querystring are the smallest among all strings in a given database.Contrast to traditional approximate substring matching, top-k approximate substring matching (1) does not require eachuser to provide a maximum distance threshold, and (2) findsthe top-k strings in data whose ‘substring edit distances’ toa given query string are the k smallest values. To the best ofour knowledge, no existing work has addressed the top-k ap-proximate substring matching problem and our algorithmspresented in this paper are the first work for the problem.

Example 1.1.: Consider the string data in Figure 1. Sup-pose that the query string σ is ‘Jackson’ and k = 3. Since s1has the substring which is exactly σ, its substring edit dis-tance to σ is 0. For s4 and s5, the substring edit distances toσ are 1 since s4 and s5 have ‘Jacksom’ and ‘Jakson’ respec-tively. For other strings, the substring edit distances to σ areat least 1. Thus, the top-3 approximate substring matchesare {‘Jackson Pollock’, ‘Jacksomville’, ‘Jakson Pollack’}.

Given a set of strings D and a query string σ, a naivealgorithm of top-k approximate substring matching exam-ines every string s in D and compute the substring editdistance dsub(s, σ) one by one. We refer to the brute-forcealgorithm as TopK-NAIVE. Since computation of substringedit distance dsub(s, σ) is very expensive, we develop moreefficient algorithms by utilizing q-grams in the query stringsand available inverted q-gram indexes adopted widely to findapproximate string matches [9]. This paper makes the fol-lowing contributions:

• We first propose the algorithm TopK-LB, which reducesthe number of substring edit distance computations inTopK-NAIVE, by checking each string s with a lowerbound of dsub(s, σ) first based on dynamic programming.

• We next present the algorithm TopK-SPLIT which allowsnot to calculate the lower bounds of dsub(s, σ) for somestrings by dividing data D into partitions based on q-grams and utilizing our proposed lower bound of dsub(s, σ)between σ and all strings in each partition.

• We also develop the algorithm TopK-INDEX that canspeed up TopK-SPLIT by utilizing the posting lists ofsome q-grams in the query string, which can be obtainedfrom the existing inverted positional q-gram index, when-ever it is available by DBMSs.

• We conducted an extensive performance study of our pro-posed algorithms with real-life data sets. When there isno inverted q-gram index, we found that TopK-SPLITis the best performer and faster than TopK-NAIVE byan order of magnitude. Whenever the inverted indexesare available, TopK-INDEX is the best and much fasterthan TopK-SPLIT. Experimental results also confirm thatour algorithms outperform the extended traditional algo-rithms significantly and scale up well with large data sizes.

2. PRELIMINARIESIn this section, we provide definitions used in our paper

and present a precise statement of the problem for the top-kapproximate substring matching.

2.1 Notations and our Problem DefinitionLet Σ be a finite alphabet of size |Σ|. For a string s of

Σ∗, we denote the length of s with |s|. We use s[i, j] with1≤i≤j≤|s| to denote the substring of s which starts fromposition i and ends at position j. Similarly, s[i] with 1≤i≤|s|denotes the character at the i-th position of s.

Edit distance: For two strings si and sj , si can be trans-formed to sj by applying repeatedly the three operations:insertion, deletion and substitution. The edit distance be-tween si and sj , which is denoted by d(si, sj), is defined asthe minimum number of operations needed to transform sito sj (or sj to si). The dynamic programming algorithmto compute the edit distance between two strings si and sjtakes O(|σ| · |s|) time [14].

Substring edit distance: The substring edit distance froma string s to another string σ, which is represented by dsub(s, σ),is the minimum among the edit distance between σ and ev-ery substring of s. The O(|σ|·|s|)-time algorithm based ondynamic programming to compute the substring edit dis-tance is presented in [24]. (For more details, read [24, 25].)

We next present the definition of our top-k approximatesubstring matching problem.

Definition 2.1.: (Top-k Approximate Substring Match-ing) Given a set of strings D and a query string σ, the top-kapproximate substring matching problem is to find the k ap-proximate strings TOPk(σ) = 〈s1, s2, ..., sk〉 in D which isan ordered sequence of k strings satisfying the following con-ditions:

• dsub(s1, σ) ≤ dsub(s2, σ) ≤ · · · ≤ dsub(sk, σ) holds.• For every sj∈D such that sj /∈ TOPk(σ), we have dsub(sk, σ)

≤ dsub(sj , σ).

Example 2.2.: Consider the strings shown in Figure 1.Suppose that the query string σ is ‘Jackson’ and we are inter-ested in the top-2 approximate substring matches TOP2(σ).Since s1 includes σ as a substring, the substring edit dis-tance dsub(s1, σ) is 0. For the string s4, dsub(s4, σ) is 1because s4 has a substring ‘Jacksom’. For other strings, thesubstring edit distances to σ are at least 1. Thus, we getTOP2(σ)={s1,s4}.

2.2 The q-grams and Inverted q-gram IndexesWe define q-grams and positional q-grams as follows.

Definition 2.3.: The q-grams of a string s are s[i, i +q − 1]s with all 1≤i≤(|s|−q+1). The positional q-grams ofa string s are all pairs of each q-gram in s and its position(i.e., (s[i, i+ q − 1], i) with 1≤i≤(|s|−q+1)).

Example 2.4.: For the string s=‘Jackson’ and q=3, (‘Jac’,1), (‘ack’,2), (‘cks’,3), (‘kso’,4) and (‘son’,5) are the posi-tional 3-grams of the string s.

Given a q-gram and a string containing the q-gram, a pairof its string id and the position where the q-gram occurs inthe string is called a posting. With a set of strings D, theposting list of a q-gram is the list of postings for every oc-currence of the q-gram in all strings in D. Note that everyposting list is primarily sorted in increasing order of string

386

Page 3: Efï¬cient Top-k Algorithms for Approximate Substring Matching

Jac (s1,1), (s4,1)

ack (s1,2), (s4,2), (s6,2)

cks (s1,3), (s4,3), (s6,3)

kso (s1,4), (s4,4), (s5,3),(s6,4)

son (s1,5), (s3,3), (s4,5), (s5,4), (s6,5)

on_ (s1,6), (s3,4), (s5,5), (s6,6)

n_P (s1,7), (s3,5), (s5,6), (s6,7)

_Po (s1,8), (s2,6), (s3,6), (s5,7), (s6,8)

Pol (s1,9), (s2,7), (s3,7), (s5,8), (s6,9)

oll (s1,10), (s2,8), (s5,9)

llo (s1,11)

loc (s1,12), (s3,10), (s6,11)ock (s1,13), (s3,11), (s6,12)Jak (s2,1), (s5,1)

ako (s2,2)

kob (s2,3)

ob_ (s2,4)

b_P (s2,5)

lla (s2,9), (s5,10)

lac (s2,10), (s5,11)

ack (s2,11), (s5,12)Jas (s3,1)

aso (s3,2)omv (s4,6)mvi (s4,7)

vil (s4,8)

ill (s4,9)

lle (s4,10)

aks (s5,2)

kso (s5,3)

q-gram posting list q-gram posting list q-gram posting list

olo (s3,8), (s6,10) Mac (s6,1) som (s4,6)

Figure 2: An inverted index for the database D

ids and secondarily sorted in ascending order of positions.With an inverted q-gram index of D, we can access the post-ing lists with q-grams as keys. In Figure 2, we show theinverted 3-gram index of the strings in Figure 1.

3. THE LOWER BOUNDS OF SUBSTRING

EDIT DISTANCESIn this section, to identify the strings in D which do not

need to compute actual substring edit distances, we presenthow to compute the lower bounds of substring edit dis-tances by utilizing q-gram properties. For correctness andefficiency, we develop our techniques to guarantee no falsedismissals and few false positives, respectively. To achievethese goals, we devise several filtering methods.

3.1 A Lower Bound of dsub(s, σ)We denote the lower bound of the edit distance d(s, σ)

between a string s and a query string σ by ℓo(d(s, σ)). Thefollowing lemma borrowed from [9] allows us to computeℓo(d(s, σ)).

Lemma 3.1.: [9] Given a query string σ and a string s,if s has at most c common q-grams with σ, the edit distancebetween d(s, σ) satisfies the following inequality.

d(s, σ) ≥ ⌈(max(|σ|, |s|)− q + 1− c)/q⌉

In other words, ℓo(d(s, σ)) = ⌈(max(|σ|, |s|)− q+ 1− c)/q⌉.

We now present the lemma below which allows us to com-pute a lower bound of the substring edit distance dsub(s, σ)between a string s and a query string σ. Remember thats[i, j] denotes the substring of s which starts from positioni and ends at position j.

Lemma 3.2.: Consider a query string σ and a string s. Letc be the number of common q-grams between σ and s. Fur-thermore, let ci be the number of common q-grams betweenσ and s[i, i+|σ|-1]. Then, the lower bound of dsub(s, σ), rep-resented by ℓo(dsub(s, σ)), is

ℓo(dsub(s, σ)) ={

⌈(|σ| − q + 1− c)/q⌉, if |s| < |σ|,

min1≤i≤|s|−|σ|+1⌈(|σ| − q + 1− ci)/q⌉, otherwise.(1)

Proof: If |s| < |σ|, σ cannot be a substring of s. Thus, thelower bound of dsub(s, σ) is the same as the lower bound ofd(s, σ) which is actually ⌈(|σ|−q+1−c)/q⌉ by Lemma 3.1.When |s| ≥ |σ|, in order to compute the lower bound

of dsub(s, σ), we can enumerate every substring s′ of s andcompute the lower bound of edit distance between σ andevery substring s′.Consider a substring s[i, i+ℓ−1] of length ℓ. If ℓ is larger

than |σ|, the lower bound of d(s[i,i+ℓ−1], σ), denoted byℓo(d(s[i,i+ℓ−1), σ)), is at least ℓo(d(s[i,i+|σ|−1], σ)). It isbecause the lower bound derived by Lemma 3.1 monoton-ically grows with increasing the length of string s. Since

computing a substring edit distance requires to find the sub-string with the minimum edit distance to σ, we can ignorethe substrings of s whose lengths are larger than |σ|. Thus,ℓo(dsub(s, σ)) is the minimum one among ℓo(d(s[i,i+|σ|−1],σ))s with 1≤i≤|s|−|σ|+1.

To compute ℓo(dsub(s, σ)) based on the above Lemma, ifwe count the common q-grams between σ and every sub-string s[i, i+|σ|−1] with 1≤i≤|s|−|σ|+1, it takes O(|σ| · |s|)time. We now present the algorithm called DYN-LB whichcomputes a tighter lower bound ℓo(dsub(s, σ)) based on dy-namic programming by utilizing the positions of matchingq-grams with O(ℓ2) time, where ℓ represents the numberof matching q-gram pairs between s and σ. Note that wegenerally have ℓ ≪ min(|s|, |σ|).

Applying dynamic programming: We use the followingnotations for a query string σ and a string s in our dynamicprogramming formulation.• Let Xσ=〈(x1, p1),...,(x|Xσ |, p|Xσ|)〉 be the sequence of po-

sitional q-grams from σ, each q-gram of which appears inthe string s, satisfying pi < pi+1 for i=1,...,|Xσ|−1. Sim-ilarly, let Ys=〈(y1, r1),...,(y|Ys|, r|Ys|)〉 be the sequence ofpositional q-grams from s, each q-gram of which occurs inσ, satisfying ri < ri+1 for i=1,. . . ,|Ys|−1.

• For a positional q-gram pair 〈(xi, pi), (yj , rj)〉 with (xi, pi)∈ Xσ, (yj , rj) ∈ Ys and xi=yj , let m[i, j] represent thelower bound of all edit distances d(σ[1, pi+q−1],s′) wheres′ is every substring of s ending at position rj+q−1.

• For two positional q-gram pairs {(xu, pu), (xi, pi)} ⊆ Xσ

and {(yv, rv), (yj , rj)} ⊆ Ys satisfying xu=yv, xi=yj , pu<piand rv<rj , let δ(pu, pi, rv, rj) denote the lower bound ofthe edit distance d(σ[pu, pi+q−1],s[rv, rj+q−1]).

Computing m[i, j]: For a positional q-gram pair 〈(xi, pi),(yj , rj)〉 such that (xi, pi) ∈ Xσ, (yj , rj) ∈ Ys and xi=yj ,depending on whether xi (=yj) is the only common q-grambetween σ[1, pi+q−1] and every substring of s ending at po-sition rj+q−1 or not, we split into two cases.

(1) When xi (=yj) is the only common q-gram betweenσ[1, pi+q−1] and every substring of s ending at positionrj+q−1, the edit distance between them is at least ⌈(pi−1)/q⌉due to to Lemma 3.2.

(2) Otherwise, suppose that 〈(xu, pu), (yv, rv)〉 is the lastpositional q-gram pair satisfying pu<pi, rv<rj and xu=yv,which minimizes m[i, j]. Then, for every such a matching q-gram pair 〈(xu, pu), (yv, rv)〉, m[i, j] (i.e., the lower bound ofthe edit distance between σ[1, pi+q−1] and every substringof s ending at position rj+q−1) is the sum of m[u, v] andδ(pu, pi, rv, rj) (i.e., the lower bound of the edit distanced(σ[pu, pi+q−1],s[rv, rj+q−1])). Since showing the optimalsubstructure of this problem is easy, we omit the proof.

Since we do not know the positional q-gram pair 〈(xu, pu),(yv, rv)〉 satisfying pu<pi, rv<rj and xu=yv between σ[1, pi+q−1] and every substring of s ending at position rj+q−1, weconsider every 〈(xu, pu), (yv, rv)〉 ∈ X×Y as the last match-ing pair followed by 〈(xi, pi), (yj , rj)〉 to compute m[i, j].Thus, the recursive solution to compute m[i, j] is

m[i, j] = min{⌈(pi − 1)/q⌉,

min∀〈(xu,pu),(yv,rv)〉∈X×Y :

(xu=yv)∧(pu<pi)∧(rv<rj)

{m[u, v] + δ(pu, pi, rv, rj)}}. (2)

For every pair 〈(x1, p1), (yj , rj)〉 with x1=yj , there is onlya single matching q-gram between σ[1, p1+q−1] and everysubstring of s ending at position rj+q−1, and thus we setm[1, j] to ⌈(p1−1)/q⌉ according to Lemma 3.2.

387

Page 4: Efï¬cient Top-k Algorithms for Approximate Substring Matching

σ e

s ?

v

?

J c s e

J c s

s

s no

?

σ J a c k e

s ...? ? ?

v

? ? ?

s no

??? ?

1 2 3 4 5 6 7 8 9 10 11 12

J a c k e

? ? s

s

i? ?

(a) σ and s with four common q-grams

i l lJ c k s o n e

J c

v

? ?

σ

s

i l l

? ?? ?? ?? ? ?

(b) Transforming `Jacksonvill' to the substrings of s ending with `ill'

σ J c k e

s J c ?

v

? ? ?

s no

?

J

? c

σ l e

s ?

v

?

s no

?c

s o n ev

? ?

σ

s

i l l

? ?? ? ? ?

J

J ...

J a c k s o n ev

? ?

σ

s

i l l

? ?? ?? ? ?? ??

(c) Computing the lower bound of dsub(s,σ) using m[i,j]s

(m[1,1]=0) + 3 = 3 (m[2,2]=0) + 3 = 3 (m[3,4]=2) + 2 = 4 (m[4,3]=2) + 1 = 3lo(dsub(s,σ))= 4

...

...

Figure 3: Computing the lower bound of dsub(s, σ) with DYN-LB

Computing ℓo(dsub(s, σ)): Now we show how to computeℓo(dsub(s, σ)) using m[i, j]. When every q-gram in σ is mod-ified, the lower bound of dsub(s, σ) is ⌈(|σ|−q+1)/q⌉ due toLemma 3.2. Otherwise, suppose that the last matching q-gram pair is 〈(xi, pi), (yj , rj)〉 with (xi, pi) ∈ Xσ, (yj , rj)∈ Ys and xi=yj . The lower bound of dsub(s, σ) is the sumof m[i, j] and the minimum number of edit operations tomodify every remaining q-gram in σ[pi+1, |σ|] so that thereis no matching q-gram between σ[pi+1, |σ|] and s[rj+1, |s|],which is ⌈(|σ|−pi−q+1)/q⌉ according to Lemma 3.2. Sincewe do not know the actual last matching positional q-grampair, we consider every 〈(xi, pi), (yj , rj)〉 ∈ X × Y such thatxi=yj as the last matching pair to compute ℓo(dsub(s, σ)).Thus, the lower bound of dsub(s, σ) is

min{⌈(|σ| − q + 1)/q⌉,

min1≤i≤|Xσ|,1≤j≤|Ys|

{m[i, j] + ⌈(|σ| − pi − q + 1)/q⌉}. (3)

Computing δ(pu,pi, rv, rj): Given two positional q-grampairs {(xu, pu), (xi, pi)} ⊆ Xσ and {(yv, rv), (yj , rj)} ⊆ Ys

satisfying xu=yv, xi=yj , pu<pi and rv<rj , the transforma-tion from σ[pu, pi+q−1] to s[rv, rj+q−1] by performing atleast an edit operation on every q-gram in σ[pu, pi+q−1]except the two q-grams (xu, pu) and (xi, pi) is the same asconverting σ[pu+1, pi+q−2] to s[rv+1, rj+q−2] by modify-ing every q-gram in σ[pu+1, pi+q−2] without exception.The lower bound of the edit distance for the transforma-

tion performing at least an edit operation on every q-gram inσ[pu+1, pi+q−2] is ⌈(pi−pu−1)/q⌉ (=⌈(((pi+q−2)−(pu+1)+1)−q+1−0)/q⌉) due to Lemma 3.2. Furthermore, the dif-ference between the lengths of two strings is a definite lowerbound of the edit distance [9]. Thus, δ(pu, pi, rv, rj) becomesmax{⌈(pi − pu − 1)/q⌉, |(pi − pu)− (rj − rv)|}.

Time complexity: Let ℓ be the number of matching q-gram pairs 〈(xi, pi), (yj , rj)〉 ∈ X×Y such that xi=yj . Sincethe number of entries m[i, j] is ℓ and it takes O(ℓ) time tocompute each m[i, j] in Equation (2), the time complexityof DYN-LB is O(ℓ2).

Example 3.3.: Suppose we have σ=‘Jacksonville’ of length12 and s=‘Jack Willson’. The matching positional q-gramsets in σ and s are Xσ={(Jac,1), (ack,2), (son,5), (ill,9)}and Ys={(Jac,1), (ack,2), (ill,7), (son,10)} respectively asillustrated in Figure 3(a). With the first matching q-gram‘Jac’ in both Xσ and Ys, m[1, 1] becomes 0. With the sec-ond matching positional q-gram pair (x2, p2)=(ack,2) in Xσ

and (y2, p2)=(ack,2) in Ys, m[2, 2] becomes 0. With the nextmatching q-grams (x4, p4)=(ill,9) in Xσ and (y3, p3)=(ill,7)in Ys, we consider the following transformations, as shownin Figure 3(b):

• The transformation that modifies every q-gram except thelast q-gram (ill,9) in σ, which takes at least 3 edit opera-tions (=⌈(9− 1)/3⌉).

• Modifying σ[1, 4]=‘Jack’ to a substring of s ending with‘ack’ followed by changing ‘acksonvill’ to ‘ack Will’, whichrequires at least 2 edit operations (=m[2, 2]+max( ⌈(9−2−1)/3⌉, |10−8|)).

• Transforming σ[1, 3]=‘Jac’ to s[1, 3]=‘Jac’ and converting‘Jacksonvill’ to ‘Jack Will’, which needs at least 3 editoperations (=m[1, 1]+max(⌈(9−1−1)/3⌉, |11− 9|)).

Now, m[4, 3] becomes 2. Similarly, we obtain m[3, 4]=2. Fi-nally, to compute the lower bound of dsub(s, σ) by Equa-tion (3), we examine the lower bounds by the five transfor-mations in Figure 3(c), such that every matching q-gram inXσ is modified or each q-gram in X is the right most q-gramwhich is not changed. Then, we obtain 3 as the lower boundof dsub(s, σ).

3.2 A Lower Bound of Substring Edit Distancesbetween a Query and a Set of Strings

Previously, we could compute the lower bound of dsub(s, σ)using the common q-grams between a query string σ and thestring s. However, we next show how to compute the lowerbound of dsub(s, σ) using the mismatching q-grams from σto s (i.e., the q-grams occurring in σ but not in s).

Lemma 3.4.: Consider a query string σ and a string s. Ifthe number of mismatching q-grams from σ to s is at leastm, we have dsub(s, σ) ≥ ⌈m/q⌉.

Proof: When a string s2 is obtained by applying a singleedit operation on a string s1, we have at most q mismatch-ing q-grams from s1 to s2. Since at least m q-grams in σ donot appear in s, we need at least ⌈m/q⌉ edit operations toconvert σ to s (i.e., ℓo(d(s, σ)) = ⌈m/q⌉). Furthermore, forevery substring s′ of the string s, since the number of mis-matching q-grams from σ to s′ is at least that of mismatch-ing q-grams from σ to s, we can claim that the number ofmismatching q-grams from σ to s′ is at least m. Thus, wehave d(s′, σ) ≥ ⌈m/q⌉ for every substring s′ of s and we canconclude dsub(s, σ) ≥ ⌈m/q⌉.

Consider a subset G′ ⊆ G where G is the set of all possibleq-grams from σ. Assume that we partition D into D+

G′ and

D−G′ so that D+

G′ is the set of strings in D which share at

least a q-gram in G′ and D−G′ is the rest of strings in D

(i.e., D−G′ = D−D+

G′). Let dsub(D, σ) represent the smallestsubstring edit distance between σ and every string s in D.We denote the lower bound of dsub(D, σ) by ℓo(dsub(D, σ)).

The lower bounds of dsub(D−G′ , σ): We can compute a

lower bound of the substring edit distances between σ andevery string in D−

G′ by the following lemma.

Lemma 3.5.: Assume that we have a set of strings D anda query string σ. For each subset G′ ⊆ G where G is theq-gram set of σ, we have dsub(D

−G′ , σ) ≥ ⌈|G′|/q⌉.

388

Page 5: Efï¬cient Top-k Algorithms for Approximate Substring Matching

Proof: Due to Lemma 3.4, for every string s in D−G′ which

does not include any q-gram in G′, dsub(s, σ) ≥ ⌈|G′|/q⌉holds. Thus, we have dsub(D

−G′ , σ) ≥ ⌈|G′|/q⌉.

While the lower bound of obtained by Lemma 3.5 is ef-fective, it does not consider the positions of the q-grams inσ. In other words, we assumed that every q-gram in G′ mayoverlap with another q-gram in G′ in at least a position ofσ. However, if we choose the q-grams for G′ such that everyq-gram in G′ does not overlap with another q-gram in G′,we can obtain a tighter lower bound of ℓo(dsub(D

−G′ , σ)).

Lemma 3.6.: Let G be the set of all q-grams in a querystring σ. If we select a subset G′ ⊆ G in which every q-gram does not overlap with another q-gram in G′, we havedsub(D

−G′ , σ) ≥ |G′|.

Proof: For each string s in D−G′ , suppose that we perform

edit operations on σ to convert it to a substring of s. Sinceevery q-gram in G′ does not appear in s, every q-gram in G′

should be modified by at least an edit operation. However,every q-gram in G′ does not overlap with another q-gram inG′ and thus a single edit operation cannot affect multipleq-grams. Thus, for every string s in D−

G′ , we need at least|G′| edit operations to transform σ to any substring of s.

Example 3.7.: Suppose we have a query string σ=‘Jackson’and G′={‘Jac’,‘kso’} in which every 3-gram does not over-lap with another q-gram in G′. The strings in D−

G′ are thestrings which do not have even one of the 3-grams in G′.Since each 3-gram in G′ does not overlap with another one,to transform σ to every substring of each string in D−

G′ , weneed at least 2 edit operations (an edit operation on each of‘Jac’ and ‘kso’) in σ. Note that we have dsub(D

−G′ , σ) ≥ 2

by Lemma 3.6.

If we fix the size of G′ ⊆ G, we obtain the tightest lowerbound ℓo(dsub(D

−G′ , σ)) when G′ is selected without any pair

of overlapping q-grams.

Corollary 3.8.: Consider a query string σ and let G bethe set of q-grams in σ. We obtain the tightest lower boundℓo(dsub(D

−G′ , σ)) among every possible G′ ⊆ G when the size

of G′ is ⌊|σ|/q⌋.

Proof: This is because we cannot select a set of q-gramswith a larger size than ⌊|σ|/q⌋ in which every q-gram doesnot overlap with another q-gram in the set.

We now introduce the following lemma to use Lemma 3.6effectively for top-k approximate substring matching.

Lemma 3.9.: Consider a set of strings D, a query stringσ, the q-gram set G of σ and a subset G′ ⊆ G. Assume thatwe partitioned D into D+

G′ and D−G′ . While examining every

string in D+G′ , if there are at least k strings whose substring

edit distances to σ are at most ℓo(dsub(D−G′ , σ)), we can find

the top-k approximate substring matches in D by examiningthe remaining strings in D+

G′ only by ignoring the remaining

strings in D−G′ .

Proof: While examining every string in D+G′ one by one, let

sk denote the string with the k-th smallest substring edit dis-tance dsub(sk, σ) so far. Since dsub(sk, σ)≤ ℓo(dsub(D

−G′ , σ)),

for every string s′ ∈ D−G′ , dsub(s

′, σ) cannot get smaller thandsub(sk, σ) and we can not have any string of the top-k ap-proximate substring matches in D−

G′ .

Function TopK-LB (D, k, σ)begin

1. HTopK = an empty max-heap storing 〈dist, str〉;2. G = {σ[i,i+q−1]|1≤i≤|σ|−q+1};3. for each string s in D do

4. R = {(s[i,i+q−1], i) |∀i s.t.1≤i≤|s|−q+1, s[i,i+q−1]∈G};5. LB = DYN-LB(R, |σ|);6. if |HTopK |<k or LB<HTopK .getMax().dist then

7. d = dsub(s, σ);8. if |HTopK | < k then HTopK .insert(〈d,s〉);9. else if HTopK .getMax().dist > d then

10. HTopK .insert(〈d,s〉);11. HTopK .deleteMax();12. end for

13. return HTopK ;end

Figure 4: The TopK-LB algorithm

4. TOP-K ALGORITHMS FOR APPROXI-

MATE SUBSTRING MATCHINGWe first introduce the brute-force algorithmTopK-NAIVE

which blindly examines every string s in a set of strings Dand computes substring edit distances dsub(s, σ) one by one.Since distance computations are very expensive, to reducethe number of substring edit distances to compute, we pro-pose the algorithm TopK-LB which computes dsub(s, σ) onlywhen the lower bound ℓo(dsub(s, σ)) is smaller than the cur-rent k-th smallest substring edit distance. The lower boundℓo(dsub(s, σ)) is computed by our novel dynamic program-ming algorithm in TopK-LB.

Since TopK-LB utilizes the algorithm based on dynamicprogramming to compute the lower bound ℓo(dsub(s, σ)) withevery string s in D, we next propose the algorithm TopK-SPLIT which can even skip the expensive computation ofℓo(dsub(s, σ)) for some strings s in D. TopK-SPLIT dividesD into partitions for a given q-gram set G′ and utilizes ournew lower bound of dsub(s, σ). We also develop a novel dy-namic programming algorithm to find the best q-gram set G′

to be used at the beginning of TopK-SPLIT. Furthermore,whenever the k-th substring edit distance becomes smaller,TopK-SPLIT adaptively adjusts G′ so that we can skip evenmore strings for computing the lower bounds of dsub(s, σ).

TopK-SPLIT still examines every string in D and thus itsrunning time increases proportionally to the size of D. Tospeed up TopK-SPLIT further, we develop TopK-INDEXthat does not need to check every string in D by utilizingthe inverted q-gram indexes, whenever they are available.

We next present the above algorithms for finding the top-kapproximate substring matches in more details.

TopK-NAIVE: The naive algorithm examines every strings in D and computes the substring edit distance dsub(s, σ).To find the top-k approximate substring matches in D, wemaintain a max-heap HTopK storing the k strings s′ withthe smallest dsub(s

′, σ)s which are used as the keys in themax-heap HTopK . If the size of HTopK is less than k, we justinsert the string s to HTopK . Otherwise, we check whetherdsub(s, σ) is smaller than dsub(sR, σ) of the string sR at theroot of HTopK (i.e., whether dsub(s, σ) is smaller than thek-th smallest substring edit distance so far). If it is satisfied,we delete the string at the root sR of HTopK and insert thestring s to HTopK . If not, we move to the next string andrepeat the above step until we encounter the last string inD. We refer to the naive algorithm as TopK-NAIVE.

TopK-LB: Similar to TopK-NAIVE, we scan all strings inD. However, for each string s in D, we generate the list of

389

Page 6: Efï¬cient Top-k Algorithms for Approximate Substring Matching

common positional q-grams R = 〈(g1, p1), . . . (gn, pn)〉 be-tween σ and s in increasing order of positions where the po-sition pi of each (gi, pi) in R is the position in s at which theq-gram gi appears. Then, we can compute the lower boundℓo(dsub(s, σ)) by invoking DYN-LB presented in Section 3.1.If ℓo(dsub(s, σ)) is not smaller than the k-th smallest sub-string edit distance so far, we do not need to calculate theactual value of dsub(s, σ) since s cannot be a string of thetop-k approximate substring matches. We call the algorithmTopK-LB and present its pseudocode in Figure 4.

Example 4.1.: Consider the set of strings D in Fig-ure 1. Suppose the query string σ=‘Jacksen’ of length 7and we are interested in the top-2 approximate matches.The first two strings s1 and s2, whose substring edit dis-tances to σ are 1 and 4 respectively, are inserted into themax-heap HTopK . For the next string s3=‘Jason Polock’,the lower bound ℓo(dsub(s3, σ)) is 2 by DYN-LB. Since thesecond smallest substring edit distance in HTopK is 4 andℓo(dsub(s3, σ)) is smaller than 4, we have to compute dsub(s3, σ),which turns out to be 3, and insert (s3, 3) into HTopK . Nowwe have {(s1, 1), (s3, 3)} in HTopK and the second smallestsubstring edit distance so far becomes 3.With s4=‘Jacksomville’, ℓo(dsub(s4, σ)) becomes 1. Since

ℓo(dsub(s4, σ)) is smaller than the second smallest substringedit distance in HTopK , we have to compute dsub(s4, σ) whichis 2 and (s4, 2) is added to HTopK . Then, the second small-est substring edit distance so far becomes 2. For s5=‘JaksonPollack’, ℓo(dsub(s5, σ)) is 2 and thus we can skip computingexpensive dsub(s5, σ). Finally, with s6=‘Mackson Polock’,ℓo(dsub(s5, σ)) is 1 and we should compute dsub(s5, σ). Insummary, we calculated the substring edit distances with 5strings out of 6 strings.

TopK-SPLIT: Let G be the set of all q-grams in σ and letG′ be a subset of G such that every q-gram in G′ does notoverlap with another q-gram in the set. We conceptuallydivide the strings in D into D+

G′ and D−G′ where D+

G′ is theset of strings in D each of which has at least a q-gram in G′

and D−G′ is (D−D+

G′). While scanning each string s, we can

check easily whether the string s belongs to D+G′ or D−

G′ bychecking the q-grams in s.Similar to TopK-LB, when examining every string s in D

one by one, TopK-SPLIT generates the list of common posi-tional q-gramsR and skips computing dsub(s, σ) if ℓo(dsub(s, σ))is at least the k-th smallest substring edit distance so far.However, unlike TopK-LB, TopK-SPLIT can even skip com-puting ℓo(dsub(s, σ)) by using Lemma 3.9 for each string asfollows.Let sR denote the string whose substring edit distance

to σ (i.e., dsub(sR, σ)) is the k-th smallest value among thestrings examined so far. Among the unseen strings, we canskip distance computations for the strings whose substringedit distances to σ are at least dsub(sR, σ). Since we splitD into D+

G′ and D−G′ , after dsub(sR, σ) becomes at most

ℓo(dsub(D−G′ , σ)), we can skip every unseen string s belong-

ing to D−G′ . In this case, we have

dsub(sR, σ) ≤ |G′| ≤ dsub(D−G′ , σ) (4)

since the lower bound of dsub(D−G′ , σ) is |G

′| due to Lemma 3.6regardless of the q-grams in G′. However, for the unseenstrings s ∈ D+

G′ , we still have to compute dsub(s, σ).Intuitively, the larger |G′| is, the earlier the condition of

dsub(sR, σ) ≤ |G′| in Inequality (4) can be satisfied. Thus,

Function TopK-SPLIT (D, k, σ)begin

1. HTopK = an empty max-heap storing 〈dist, str〉;2. G = {σ[i,i+q−1]|1≤i≤|σ|−q+1};3. τ = ⌊|σ|/q⌋;4. cond = false;5. for each string s in D do

6. R = {(s[i,i+q−1], i) |∀i s.t.1≤i≤|s|−q+1, s[i,i+q−1]∈G};7. if cond=true and ∃(gi, pi) ∈ R, gi ∈ G′ then part = ‘+’;8. else part = ‘−’;9. if cond=false or part = ‘+’ then10. LB = DYN-LB(R, |σ|);11. if |HTopK |<k or LB<HTopK .getMax().dist then

12. d = dsub(s, σ);13. if |HTopK |<k then HTopK .insert(〈d,s〉);14. else if HTopK .getMax().dist>d then

15. HTopK .insert(〈d,s〉);16. HTopK .deleteMax();17. if HTopK .getMax().dist≤τ then

18. cond = true;19. τ = HTopK .getMax().dist;20. G′ = BEST-G′(τ , G);21. if cond=true and HTopK .getMax().dist<τ then

22. τ = HTopK .getMax().dist;23. G′ = BEST-G′(τ , G);24. end for

25. return HTopK ;end

Figure 5: The TopK-SPLIT algorithm

when |G′| is large, we have a smaller number of strings in Dto examine before dsub(sR, σ) ≤ |G′| holds. However, oncethe inequality is satisfied, with a large |G′|, the number ofthe unseen strings belonging to D+

G′ also becomes generallylarge.

To satisfy dsub(sR, σ) ≤|G′| as soon as possible, we ini-tially set |G′| with ⌊|σ|/q⌋. Note that the lower bound ofdsub(D

−G′ , σ) is |G′| and cannot be larger than ⌊|σ|/q⌋ due

to Corollary 3.8. Furthermore, note that we need G′ to de-termine D+

G′ to prune unseen strings only when dsub(sR, σ)≤|G′| is satisfied, since the inequality holds in the same lo-cation in D regardless of G′ with the same size of G′. Thus,once dsub(sR, σ)≤⌊|σ|/q⌋ holds, while scanning strings in D,we compute G′ of size dsub(sR, σ) and next examine the un-seen strings in D+

G′ only. For a given size ρ of G′, we willdiscuss how to select such a subset G′ minimizing the num-ber of strings in D+

G′ in Section 5. Thus, for the time being,we will assume that such a G′ is provided by the functionBEST-G′(ρ).

Let τ be the k-th substring edit distance when dsub(sR, σ)≤⌊|σ|/q⌋ is first satisfied. While we examine the strings inD, the k-the smallest substring edit distance monotonicallydecreases. When the k-the smallest substring edit distanceis changed from τ to τ ′′, we can still test the unseen stringsin D+

G′ only. However, since any G′′ satisfying |G′′|<|G′|

and dsub(sR, σ) = τ ′′ ≤ |G′′| allows us not to examine D−G′′

and we have |D−G′′ |≤|D−

G′ | generally, we can select such a G′′

and check the unseen strings in D+G′′ only. We can perform

the above step whenever the k-th substring edit distancedecreases.

The pseudocode of TopK-SPLIT is provided in Figure 5.The lower bound of dsub(D

−G′ , σ) (i.e., τ) is set to ⌊|σ|/q⌋

initially which is the largest one according to Corollary 3.8.The variable cond has false initially representing that thek-th smallest substring edit distance so far exceeds τ . Untilwe set cond as true, while scanning each string s in D, ifLB = ℓo(dsub(s, σ)) computed by DYN-LB is smaller thanthe k-th smallest substring edit distance so far, we com-pute the actual value of dsub(s, σ) similar to TopK-LB. How-ever, once the k-th smallest substring edit distance (i.e.,

390

Page 7: Efï¬cient Top-k Algorithms for Approximate Substring Matching

Function TopK-INDEX (D, I, k, σ)begin

1. HTopK = an empty max-heap storing 〈dist, str〉;2. G = {σ[i,i+q−1]|1≤i≤|σ|−q+1};3. τ = |G′|;4. cond = false;5. while D.end() = false do

6. sidD = D.getCurrentId();7. sidI = I.getFrontierId(G);8. s = null;9. if cond = false and sidI > sidD then

10. s = D.getString(sidD);11. R = {(s[i,i+q−1], i) |∀i s.t.1≤i≤|s|−q+1, s[i,i+q−1]∈G};12. if ∃(gi, pi) ∈ R, gi ∈ G′ then part = ‘+’;13. else part = ‘−’;14. else

15. (R, part) = I.getPosQgrams(sidI);16. if cond = false or part = ‘+’ then17. LB = DYN-LB(R, |σ|);18. if |HTopK |<k or LB<HTopK .getMax().dist then

19. if s = null then s = D.getString(sidI);20. d = dsub(s, σ);21. if |HTopK | < k then HTopK .insert(〈d,s〉);22. else if HTopK .getMax().dist > d then

23. HTopK .insert(〈d,s〉);24. HTopK .deleteMax();25. if HTopK .getMax().dist≤τ then

26. cond = true;27. τ = HTopK .getMax().dist;28. G′ = BEST-G′(τ , G);29. if cond=true and HTopK .getMax().dist<τ then

30. τ = HTopK .getMax().dist;31. G′ = BEST-G′(τ , G);32. end while

33. return HTopK ;end

Figure 6: The TopK-INDEX algorithm

HTopK .getMax().dist) becomes at most τ , we set cond totrue and select G′ by calling BEST-G′ in lines 17–21 sothat we can skip the computation of dsub(s, σ) for the un-seen strings s ∈ D−

G′ by Lemma 3.9. Furthermore, wheneverthe k-th smallest substring edit distance decreases, we selectG′ with size of HTopK .getMax().dist again in lines 23–26.

Example 4.2.: Consider the strings D in Figure 1. Sup-pose we want to find the top-2 approximate substring matchesthe query string σ = ’Jacksen’ using 3-grams. The lowerbound of dsub(D

−G′ , σ) (i.e., τ) τ is initially set to 2(=⌊7/3⌋).

First, the strings s1 and s2 are inserted into HTopK whosesubstring edit distances to σ are 1 and 4 respectively. Withs3 and s4, we compute dsub(s3, σ)(=3) and dsub(s4, σ)(=2),and then the second smallest substring edit distance in HTopK

becomes 2.We now have 2 strings whose substring edit distances are

at most τ . After selecting the q-grams set G′ ⊆ G by BEST-G′, we can skip computing dsub(s, σ) with the strings s ∈D−

G′ due to Lemma 3.9. Suppose that the selected 3-gram setG′ is {‘Jac’, ‘kse’}. Then, for the strings s5 and s6, we canignore them because they do not have any common 3-gramwith G′. In summary, we could find the top-2 approximatesubstring matches by calculating the substring edit distanceswith 4 strings only.

TopK-INDEX: When the k-th smallest substring edit dis-tance so far becomes at most |G′| in TopK-SPLIT, we caneven skip computing ℓo(dsub(s, σ)) with the strings s ∈ D−

G′ .However, we still have to read unseen string s in D to de-termine whether s belongs to D−

G′ by checking if s has atleast a q-gram in G′. We will next improve the inefficiencyby retrieving the unseen strings belonging to D+

G′ only withutilizing inverted q-gram indexes when they are available.

While scanning each string inD, our new algorithm TopK-INDEX also scans the posting lists of the q-grams in G to-gether. We assume that the strings in D are sorted withincreasing order of string ids and we are also able to retrievethe strings in D with string ids as keys by random I/O ac-cess. We also assume that the posting list of every q-gramin G is stored with the increasing order of string id first andits position next in the inverted q-gram index. By readingthe posting lists of the q-grams in G only, we can obtain thecommon q-gram list R for the strings which share at leasta q-gram with σ. Whenever we need to compute the actualsubstring distance dsub(s, σ), we actually access the string sstored in D. Since the size of each string in D is generallymuch larger than that of the query string σ, using the post-ing lists of the q-grams in G only speeds up the generationof R and computation of ℓo(dsub(s, σ)) significantly.

After the condition to apply Lemma 3.9 is satisfied (i.e.,dsub(sR, σ)≤ ℓo(dsub(D

−G′ , σ))), instead of sequentially read-

ing the unseen strings in D, we access the remaining stringss ∈ D+

G′ only by random I/O access with utilizing the in-

verted q-gram index to skip the remaining strings s ∈ D−G′ .

Similar to TopK-SPLIT, whenever dsub(sR, σ) decreases, weselect a smaller G′ with |G′| = dsub(sR, σ) to reduce the sizeof D−

G′ .The pseudocode of TopK-INDEX is presented in Figure 6.

The difference from TopK-SPLIT is the lines 6–16 in whichthe common positional q-gram list R is produced by readingthe posting lists of q-grams in G sequentially in increasingorder of string ids.

Note that every posting list in the inverted index is sortedby string ids and positions. To obtain the list R in increasingorder of string ids, we initially set the frontier of each postinglist of a q-gram in G as the location of the first posting in theposting list. Then, we read the smallest id among the post-ings in all frontiers and move the frontier to the next postingin its posting list. Let I denote the given inverted index. Thecall of I.getFrontierId(G) returns the smallest string id sidIamong the ids in all frontiers, and I.getPosQgrams(sidI)returns the common positional q-gram list R between thestring of sidI and the query string σ by reading all postings(gi, pi) from the posting lists of the q-grams in G such thatgi=sidI . The invocation of I.getPosQgrams also informswhether the string belongs to D+

G′ or D−G′ .

If we just follow the posting lists of the q-grams in G onlyto generate the list R, there may exist a string s in D whichmay not have any common q-gram with the query string σand we may miss all of such strings. Thus, while we movea cursor on D together in the increasing order of string ids,if the smallest frontier id sidI is larger than the id sidD ofD which the current cursor indicates, we read the string ofsidD from D, generate R from the actual string in D, andthen perform the computations of lower bound and actualdistance if necessary.

After cond becomes true, we always go to line 15 to readthe posting lists to obtain the common q-gram list R with-out reading strings from D. The reason is that, after wefind at least k strings whose substring edit distances are atmost ℓo(dsub(D

−G′ , σ)), if the query string σ and a string

s in D do not share at least a q-gram, the string s be-longs to D−

G′ and thus we do not need to check such stringsin D by Lemma 3.9. Furthermore, if the string s of sidIshares at least a common q-gram with G′ (i.e., part is ‘+′),we compute ℓo(dsub(s, σ)) in line 17. If the lower bound

391

Page 8: Efï¬cient Top-k Algorithms for Approximate Substring Matching

σi=15

Figure 7: Computing u[i, t] and θ[i, t]

ℓo(dsub(s, σ)) is smaller than the k-th smallest substring editdistance so far, we retrieve the string s from D by randomI/O access to compute dsub(s, σ) as in line 20.

Example 4.3.: Consider the strings D in Figure 1. Sup-pose that the query string σ is ‘Jacksen’ and we want to findthe top-2 approximate substring matches from D using 3-grams. The 3-gram set G of σ is {‘Jac’, ‘ack’, ‘cks’, ‘kse’,‘sen’}. Initially, the lower bound τ of dsub(D

−G′ , σ) is set to

2 (=⌊7/3⌋).At the beginning, the minimum string id among the fron-

tiers of the posting lists of q-rams in G is s1. The currentcursor of D is also at s1. Since we can obtain the commonq-gram list R of s1 from the posting lists, in each postinglist, we read the frontier posting and move the frontier tothe next. We also move the cursor of D to the next stringwithout examining s1. Since the current HTopK is empty,we compute dsub(s1, σ)=1 and insert (s1, 1) into HTopK .Now, the minimum string id among the frontiers is s4.

However, since the current cursor of D indicates s2, we readthe s2 and compute dsub(s2, σ)=4. Similarly, we read s3from D to calculate dsub(s3, σ)=3 and update HTopK . Nowthe second smallest substring edit distance so far becomes 3.We next read the frontier postings to obtain the common

3-gram list R between s4 and σ which is {(‘Jac’,1),(‘ack’,2),(‘cks’,3)}. The current cursor of D moves to the next strings5. After we compute dsub(s4, σ)=2, the second smallest sub-string edit distance so far becomes 2. From now, since wealready have 2 strings whose substring edit distances are atmost τ=2, with G′ of size 2 obtained by calling BEST-G′, wecan find the top-2 approximate substring matches by consid-ering the strings sharing at least a 3-gram in G′ only due toLemma 3.9. Let us assume that BEST-G′ selects G′={‘Jac’,‘kse’}. Then, for the strings s5 and s6, we can ignore thembecause they do not appear in any posting list of ‘Jac’ or‘kse’. In summary, we can find the top-2 approximate sub-string matches by calculating the actual substring edit dis-tances with 4 strings in D only.

5. SELECTING THE BEST G′ OF SIZE ρ

In the previous section, we assumed that G′ ⊆ G of sizeρ is provided by the function BEST-G’ for TopK-SPLITand TopK-INDEX. Now, we present the algorithm BEST-G’ which finds G′ of size ρ such that the number of stringswith at least a q-gram in G′ in D (i.e., the size of D+

G′) isthe smallest. We use the following notations to describe ourdynamic programming formulation to select such a G′.

• Let Lgi be the set of the ids of strings containing the q-gram gi∈G.

• For a q-gram set Q ⊆ G, let L(Q) =⋃

gi∈QLgi , which is

the union of Lgis for all q-grams gi ∈ Q, and we will use|L(Q)| to represent the number of elements in L(Q).

• Let Q(i, t) be the set of all possible subsets Q of q-gramsappearing in σ[1, i] such that (1) |Q|=t, (2) the q-gram

σ[i−q+1, i] (i.e., the q-gram ending at the position i) al-ways appears in Q and (3) all q-grams in Q does not over-lap to each other.

• Let θ[i, t] denote the q-gram set Q in Q(i, t) such that|L(Q)| is the minimum among all q-gram sets in Q(i, t).

• Let u[i, t] represent |L(θ[i, t])|.

Given the size ρ of G′, we select Q with the minimum|L(Q)| as G′ among all subsets Q ⊆ G consisting of ρ non-overlapping q-grams in σ. In every possible such Q, theending position of the rightmost q-gram σ[i−q+1, i] can belocated in the positions i of σ with ρ·q≤i≤|σ|. Thus, forevery substring σ[1, i] with ρ·q≤i≤|σ|, we will enumerate thebest ρ non-overlapping q-gram set Q which not only containsσ[i−q+1, i] but also has the minimum |L(Q)|. The reasonwhy we do not consider every i which is smaller than (ρ·q) isbecause the substring σ[1, i] with i < ρ·q cannot have ρ non-overlapping q-grams. Since we use θ[i, ρ] to store the bestQ with ρ non-overlapping q-grams which not only containsσ[i−q+1, i] but also has the minimum |L(Q)|, we can findG′ by selecting the θ[i, ρ] with the smallest |L(θ[i, ρ])| amongθ[i, ρ]s with ρ·q≤i≤|σ|. In other words, if we compute θ[i, t]sfor every 1 ≤ t ≤ ⌊|σ|/q⌋ and t · q ≤ i ≤ |σ|, we can selectthe q-gram set θ[i∗, ρ] as G′ where i∗ is chosen as follows:

i∗ = arg mint·q≤i≤|σ|

{u[i, ρ]} = arg mint·q≤i≤|σ|

{|L(θ[i, ρ])|} . (5)

We next present a dynamic programming algorithm tocompute θ[i, t] for t = 1, . . ., ⌊|σ|/q⌋ and i = t · q, . . ., |σ|.

Dynamic programming formulation: Since θ[i, t] con-tains t number of non-overlapping q-grams in the substringσ[1, i] including σ[i−q+1, i], we consider θ[j, t−1]∪{σ[i−q+1, i]}with 1≤j≤(i − q) to compute θ[i, t]. However, θ[j, t−1]does not exist for j=1,. . .,(t−1)·q − 1 because it is impossi-ble to have (t−1) non-overlapping q-grams in the substringσ[1, j]. Thus, we enumerate θ[j, t−1] ∪ {σ[i−q+1, i]} for(t−1) · q≤j≤(i−q) only and select θ[j, t−1] ∪ {σ[i−q+1, i]}with the smallest |L(θ[j, t−1]) ∪ Lσ[i−q+1,i]| as θ[i, t]. Ourrecursive definition for θ[i, t] is

θ[i, t] = θ[j∗, t− 1] ∪ {σ[i− q + 1, i]} (6)

where

j∗ = arg min(t−1)·q≤j≤i−q

{∣

∣Lσ[i−q+1,i] ∪ L(θ[j, t− 1])∣

}

. (7)

In Figure 7, we show an example to illustrate how we com-pute θ[15, 4]. We enumerate |Lσ[13,15] ∪ θ[j, 3]| with 9≤j≤12and select {σ[13, 15]} ∪ θ[11, 15] as θ[15, 4] which minimizes|L(θ[15, 4])|.

If we compute θ[i, t] for every t from 1 to ⌊|σ|/q⌋ andevery i from t·q to |σ| with the above dynamic programmingalgorithm, we can choose G′ with a size ρ by Equation (5).We refer to this algorithm which finds G′ with a given sizeρ as BEST-G′(ρ).

Note that BEST-G′(ρ) cannot find the optimal q-gramset G′ because the optimal substructure property of our dy-namic programming formulation is not satisfied. Suppose{σ[i−q+1, i]} ∪ Q is set to θ[i, t] in Equation (6) where Qis the optimal set with (t − 1) non-overlapping q-grams forσ[1, j∗]. Let Q′ be a non-optimal q-gram set with (t−1) non-overlapping q-grams in σ[1, j∗] including the q-gram endingat the position j∗. Even though Q′ is not an optimal q-gram set for σ[1, j∗], if Lσ[i−q+1,i] and L(Q′) share manycommon string ids so that |Lσ[i−q+1,i] ∪ L(Q′)| becomes

392

Page 9: Efï¬cient Top-k Algorithms for Approximate Substring Matching

- - -

- -

i θ[i,1] u[i,1] θ[i,2] u[i,2]

3 {Jac} 100

4 {ack} 32

5 {cks} 16

6 {kso} 10

7 {son} 120

8 {onv} 40

t=1 t=2

- -

{Jac,kso} 110

{ack,son} 125

{ack,onv} 43

- -Jac 100

ack 32

cks 16

kso 10

son 120

onv 40

g1

g2

Jac ack cks

- - -- - -

110 - -

130 125 -

120 43 50

(a) |Lg1 U Lg2| (=|L({g1,g2})|) (b) θ[i,t] and u[i,t]

-

Figure 8: Computations in BEST-G′(ρ)

smaller than |Lσ[i−q+1,i] ∪ L(Q)|, we have |L(Q′)| > |L(Q)|and thus the computed θ[i, t] is not an optimal q-gram setfor σ[1, i]. Thus, BEST-G′(ρ) finds an approximate q-gramset for G′. However, our performance study confirms thatBEST-G′(ρ) obtains a good G′ close to the optimal sets.When BEST-G′(ρ) is invoked in TopK-SPLIT or TopK-

INDEX, we need to know the actual size of L(θ[i, t]) whichis the union of the sets of string ids whose strings contain aq-gram in θ[i, t]. We will use the MinHash technique [5, 7]to estimate the sizes of unions of sets in BEST-G′(ρ).Assume that for every q-gram gi appearing in D, there is

a MinHash signature of the set of string ids in which the q-gram gi occurs. In BEST-G′(ρ), we maintain the MinHashsignature of L(θ[i, t]) to compute Lσ[i−q+1,i] ∪ L(θ[j, t− 1])in Equation (7) with constant time.

Time complexity: In BEST-G′(ρ), we have to computeθ[i, t] and u[i, t] for every 1≤t≤ ⌊|σ|/q⌋ and every t·q≤i≤|σ|.To compute each θ[i, t], it takes O(|σ|) time. For each u[i, t],it takes constant time since u[i, t] is simply |L(θ[i, t])|. Thus,the time complexity of BEST-G′(ρ) is O(|σ|3/q).

Example 5.1.: Consider the query string σ=‘Jacksonv’with |σ| = 8. Suppose that we should select the non-overlapping3-gram set G′ with size 2. For all possible non-overlapping3-gram sets Q whose sizes are 1 or 2, |L(Q)|s are shown inFigure 8(a). We show all θ[i, t]s and u[i, t]s computed byBEST-G′(ρ) in Figure 8(b).Let us assume that we want to compute θ[8, 2] that always

includes the 3-gram σ[6, 8]=‘onv’. We enumerate the fol-lowing three cases of |Lonv ∪Lcks|(=50), |Lonv ∪Lack|(=43)and |Lonv ∪ LJac|(=120) by Equation (7), and select θ[8, 2]= {‘ack’,‘onv’} with u[8, 2] = 43. Finally, to select the bestG′ with size 2 by Equation (5), we choose θ[8, 2] = {‘ack’,‘onv’} for G′ since u[8, 2] is the minimum among u[6, 2],u[7, 2] and u[8, 2].

6. EXPERIMENTSWe empirically compared the performance of our proposed

algorithms. All experiments reported in this section wereperformed on the machines with Intel Core2 Duo 2.66GHz ofprocessor and 2GB of main memory running Linux operatingsystems. All algorithms were implemented using C++ andcompiled with GCC Compiler of version 4.1.3.

6.1 Implemented AlgorithmsWe implemented the following algorithms for our perfor-

mance study.

• TopK-NAIVE: This represents the brute-force algorithmwhich computes the substring edit distance with everystring inD to find the top-k approximate substring matches.

• TopK-LB: It is the implementation of TopK-LB whichcomputes dsub(s, σ) only for the strings s in D whoselower bound ℓo(dsub(s, σ)), obtained by calling DYN-LB,is smaller than the k-th smallest substring edit distancefound so far.

• TopK-SPLIT: This is the algorithm which improves TopK-LB further by skipping even the computation of ℓo(dsub(s, σ))using the lower bound of substring edit distance betweena query and a set of strings in Lemma 3.9.

• TopK-INDEX: This is the implementation of TopK-INDEX that utilizes the inverted q-gram indexes availableto speed up TopK-SPLIT.

• TopK-NGPP: This is the modified version of NGPP in[27] to obtain the top-k approximate substring matches.The algorithm NGPP is the state-of-the-art algorithm tofind all substrings of each string in D whose edit distancesto a query string are at most a given maximum thresh-old τ . To adapt NGPP to top-k approximate substringmatching, we first read the first k strings in D and set thek-th smallest one among their substring edit distances asthe initial threshold τ . Then, as examining every stringin D, if the k-th smallest substring edit distance becomessmaller than τ , we update the threshold τ to the k-thsmallest substring edit distance.

• TopK-FSS: Similar to TopK-NGPP, we also extendedFSS in [4] to compute the top-k approximate substringmatches in the same manner.

Note that we used our own buffer management for readingthe data strings in disk and accessing inverted indexes in allof our implementations without utilizing OS buffers to seethe buffering effects for the tested algorithms more carefully.We report the native execution times (i.e., wall clock times)of the tested algorithms in this section.

6.2 Data SetsTo study the performance our proposed algorithms, we

utilize the following two real-life data sets.

DBLP: For a short string data set, we used DBLP titles col-lected from http://dblp.uni-trier.de/xml/. However, sincethe original data is small, we increased its size by duplicatingthe original data 5 times. While we duplicate each string, werandomly performed an edit operation such as insert, deleteand substitute on each position of the string with the prob-ability of 0.1. The size of generated data is 635 MB with13,966,030 strings whose average size is 48 bytes.

Wikipedia: This is the data set consisting of 106,185 webpages obtained from http://en.wikipedia.org/wiki/Wikipedia:

Database_download for a long string data set. The data sizeis 1.1 GB and the average size is 11,027 bytes.

6.3 Queries UsedFor DBLP, we randomly selected between 1 and 4 adjacent

words appearing in DBLP titles to generate test queries.The range of query lengths is from 6 to 25 with averagelength 13.2. For Wikipedia, we sampled 50 entities fromCoNLL data, which is a name entity data collected for NameEntity Recognition available for download at http://www.

cnts.ua.ac.be/conll2003/ner/, as test queries. The lengthsof selected queries are from 5 to 27 with average length 12.3.

Similarly, for our experiments with varying lengths of querystrings, we also selected the query strings of lengths from 5to 25 with DBLP titles and CoNLL data respectively. Foreach query length, 50 query strings were sampled.

6.4 Performance ResultsWe measured the performance of the algorithms for both

DBLP and Wikipedia with varying k, the number of strings

393

Page 10: Efï¬cient Top-k Algorithms for Approximate Substring Matching

0.1

1

10

100

1000

10000

100000

1 3 5 7 10 15 20

Exe

cutio

n tim

e (s

ec)

k

TopK-FSSTopK-NGPPTopK-NAIVE

TopK-LBTopK-SPLITTopK-INDEX

(a) Execution times (DBLP)

0

20

40

60

80

100

1 3 5 7 10 15 20

Num

ber

of fi

ltere

d st

rings

(%

)

k

TopK-LB(F1)TopK-SPLIT(F1)TopK-SPLIT(F2)

TopK-SPLIT(F1+F2)

(b) Filtering effects (DBLP)

1K

10K

100K

1M

10M

1 3 5 7 10 15 20

Num

ber

of s

trin

gs c

heck

ed

k

TopK-NAIVE,TopK-LB,TopK-SPLITTopK-INDEX

(c) Checked strings (DBLP)

10

100

1000

10000

1 3 5 7 10 15 20

Exe

cutio

n tim

e (s

ec)

k

TopK-FSSTopK-NGPPTopK-NAIVE

TopK-LBTopK-SPLITTopK-INDEX

(d) Execution times (Wikipedia)

0

20

40

60

80

100

1 3 5 7 10 15 20N

umbe

r of

filte

red

strin

gs (

%)

k

TopK-LB (F1)TopK-SPLIT(F1)TopK-SPLIT(F2)

TopK-SPLIT(F1+F2)

(e) Filtering effects (Wikipedia)

10K

100K

1 3 5 7 10 15 20

Num

ber

of s

trin

gs c

heck

ed

k

TopK-NAIVE,TopK-LB,TopK-SPLITTopK-INDEX

(f) Checked strings (Wikipedia)

Figure 9: Varying k using DBLP and Wikipedia

n and the length of query strings L. Furthermore, we variedthe length of q-grams q, MinHash signature size ℓ and buffersize B used. The default parameters are: k=5, q=3, ℓ=50and B=512MB.

Varying k: We first report the execution times, the numberof filtered strings and the percentage of strings read from Dwith varying k from 1 to 20 in Figure 9(a)–(f).

(1) Execution times: We show the execution times for DBLPand Wikipedia in Figure 9(a) and Figure 9(d) respectively.The log scale was used on the y-axises. In both data, TopK-NGPP and TopK-FSS show even worse performance thanour naive algorithm TopK-NAIVE. The reason is that bothof TopK-NGPP and TopK-FSS have to enumerate all sub-strings of every string in D to compute the substring editdistances. As expected from the experiments in [27], TopK-NGPP performs better than TopK-FSS.As k is increased, the execution times for TopK-LB, TopK-

SPLIT and TopK-INDEX grow gradually. Since the k-thsmallest substring edit distance so far becomes larger withgrowing k, less number of strings are skipped for computingsubstring edit distances.For every range of k, TopK-INDEX shows the best per-

formance. TopK-INDEX is faster than TopK-NAIVE by atleast 49.4 times and 5.5 times for DBLP and Wikipedia re-spectively. Furthermore, TopK-SPLIT is also faster thanTopK-NAIVE by at least 11.6 and 3.5 times for DBLP andWikipedia respectively.

(2) Filtering effects: To show the effectiveness of our filter-ing methods, we plotted the percentage of strings skippedfor computing their substring edit distances in Figure 9(b)and Figure 9(e) with DBLP and Wikipedia respectively.Note that TopK-LB(F1) in the graphs represents the per-centage of strings filtered with TopK-LB using the lowerbounds ℓo(dsub(s, σ)) obtained by DYN-LB. We also useTopK-SPLIT(F1) similarly in the graphs. TopK-SPLIT(F2)in the graphs represents the percentage of strings filtered byTopK-SPLIT using the lower bound of substring edit dis-tances between a query string and a set of strings presentedin Lemma 3.9. Furthermore, TopK-SPLIT(F1+F2) is thetotal ratio of strings filtered with TopK-SPLIT. Since the

filtering effect of TopK-INDEX is exactly the same with thatof TopK-SPLIT, we report the result of TopK-SPLIT only.

With every range of k, TopK-SPLIT(F1+F2) is alwayslarger than TopK-LB(F1) in both data sets. For k=20 withWikipedia, Figure 9(e) shows that TopK-SPLIT skips com-puting dsub(s, σ) with 17% more strings than TopK-LB does.Considering that TopK-SPLIT was 2.2 times faster thanTopK-LB in Figure 9(d), we can conclude that using thelower bound by Lemma 3.9 improves the speed of queryprocessing more than using the lower bound by DYN-LBdoes because the lower bound by Lemma 3.9 requires onlyto check whether each string contains at least a common q-gram with the query string in a constant time. With increas-ing k, both TopK-LB(F1) and TopK-SPLIT(F1) decreaseslowly since the k-th smallest distance so far also becomeslarger together with k. However, since the k-th substringedit distance found so far decreases slower as k grows, thechances that TopK-SPLIT skips the strings using the lowerbound by DYN-LB increase on the contrary and thus TopK-SPLIT(F2) grows gradually.

(3) Number of strings read from D: We also plotted thenumber of data strings examined in disk with random I/Oaccess by TopK-INDEX for DBLP and Wikipedia in Fig-ure 9(c) and Figure 9(f) respectively. The y-axises are in alog scale. The graph confirms that TopK-INDEX effectivelyreduces the cost for reading the data strings in disk.

Varying n: With each of DBLP and Wikipedia, we selectedthe strings from the original data set to produce smaller datawith varying the sampling rate from 6.25% to 100%. Withvarying the size of data, we plotted the execution times inFigure 10(a) for DBLP and in Figure 10(b) for Wikipedia.The graphs show that TopK-INDEX is the fastest in everyrange of data sizes. Both TopK-NGPP and TopK-FSS wereeven slower than TopK-NAIVE and their performance de-grades linearly as data size grows because they have to exam-ine every substring exhaustively. As the data size increases,the relative speedup of TopK-INDEX to TopK-NAIVE im-proves from 4.6 times to 8.6 times for Wikipedia. WithDBLP, the speedup increases from 9.3 times to 24.5 times.Thus, we conclude that the performance of TopK-INDEX

394

Page 11: Efï¬cient Top-k Algorithms for Approximate Substring Matching

1

10

100

1000

10000

6.25 12.5 25 50 100

Execution tim

e (

sec)

Data set size (%)

TopK-FSSTopK-NGPPTopK-NAIVE

TopK-LBTopK-SPLITTopK-INDEX

(a) Varying n (DBLP)

10

100

1000

10000

6.2512.5 25 50 100

Execution tim

e (

sec)

Data set size (%)

TopK-FSSTopK-NGPPTopK-NAIVE

TopK-LBTopK-SPLITTopK-INDEX

(b) Varying n (Wikipedia)

0

100000

200000

300000

400000

500000

600000

2 3 4 5

Exe

cu

tio

n t

ime

(se

c)

The length of q-grams

TopK-NAIVETopK-LB

TopK-SPLITTopK-INDEX

(c) Varying q

10

100

1000

64M 128M 256M 512M

Execution tim

e (

sec)

Buffer size

TopK-NAIVETopK-LB

TopK-SPLITTopK-INDEX

(d) Varying B

Figure 10: Varying n, q and B

0.1

1

10

100

1000

10000

100000

1e+06

5 10 15 20 25

Exe

cutio

n tim

e (s

ec)

Length of query strings

TopK-FSSTopK-NGPPTopK-NAIVE

TopK-LBTopK-SPLITTopK-INDEX

(a) Execution times (Wikipedia)

0

20

40

60

80

100

5 10 15 20 25N

umbe

r of

filte

red

strin

gs (

%)

Length of query strings

TopK-LB (F1)TopK-SPLIT(F1)TopK-SPLIT(F2)

TopK-SPLIT(F1+F2)

(b) Filtering effects (Wikipedia)

Figure 11: Experiments with varying L

0

50000

100000

150000

200000

10 30 50 70 100

Exe

cutio

n tim

e (

sec)

MinHash signature size

TopK-SPLITTopK-INDEX

Figure 12: Varying ℓ (Wikipedia)

does not decline linearly with increasing data size and TopK-INDEX scales well to large data.

Varying L: With varying query lengths L from 5 to 25, weplotted the execution times for Wikipedia in Figure 11(a).The log scale was used on the y-axises. With every range ofL, TopK-INDEX shows the best performance and TopK-FSSis the worst. When the lengths of query strings are 5 and 25,TopK-INDEX is 540 times and 8.44 times faster than TopK-NAIVE respectively. As L is increased, the execution timesfor all algorithms grow gradually. This is because there is aless chance that the strings in D have similar substrings toa longer query string and thus the k-th smallest substringedit distance so far becomes larger with growing L.We also plotted the percentage of strings skipped for com-

puting their substring edit distances dsub(s, σ) in Figure 11(b).With every range of L, TopK-SPLIT skips more strings forcomputing dsub(s, σ) than TopK-LB does. Furthermore,while TopK-LB computes ℓo(dsub(s, σ)) based on dynamicprogramming algorithmDYN-LB for every string inD, TopK-SPLIT skips even computing ℓo(dsub(s, σ)) for many strings.With increasing L, the gaps between the ratios of strings fil-tered by TopK-SPLIT and TopK-LB become larger becausethe query strings have more chance to have very selective q-grams resulting that more strings are filtered using the lowerbound by Lemma 3.9.

Varying ℓ: We varied the MinHash signature size ℓ from10 to 100 and plotted the running times of TopK-SPLITand TopK-INDEX in Figure 12(a). Since the other algo-rithms do not estimate the computational cost to select G′,they are not affected by the MinHash signature size. Thegraph shows that the performances of both algorithms arethe worst when ℓ=10 and do not degrade that much withℓ ≤ 50. Thus, we used ℓ=50 as the default value in all ourexperiments.

Varying q: Figure 10(c) shows the graph of execution timeas the length q of the q-grams used is varied from 3 to5. Since TopK-NAIVE is not affected by the length of q-grams, its execution times are always constant. TopK-LBand TopK-SPLIT show the best performances when q=2.This is because, with a large q, the lower bound ⌈(|σ|−q+1−n)/q⌉ in Lemma 3.2 decreases and thus, less strings are

skipped for computing dsub(s, σ). Furthermore, since themaximum of ℓo(dsub(D

−G′ , σ)), which is ⌊|σ|/q⌋, due to Corol-

lary 3.8 also becomes smaller with a larger q, the criticalpoint string where we meet the k-th string whose substringedit distance is at most ℓo(dsub(D

−G′ , σ)) appears later. How-

ever, TopK-INDEX shows the best performance when q=3because the posting lists become very large with 2-grams.

Varying B: With varying the size of buffer B from 64 MBto 512 MB, we show the execution times in Figure 10(d).With increasing B, the speed of TopK-INDEX is improvedgradually since we have more chance to hit the cached pagesof inverted indexes with larger buffer sizes.

7. RELATED WORKWe present the previous work on approximate string match-

ing first and then approximate substring matching.

Approximate string matching: For approximate stringmatching with a maximum distance threshold τ , the countfiltering technique proposed in [9] utilizes the fact that ifthe edit distance between s and σ is at most τ , they mustshare at least (max(|s|, |σ|)−q+1−τ ·q) q-grams. In [16], an-other algorithm is proposed to speed up approximate stringmatching by reducing the average size of inverted lists usingvariable length q-grams. However, this algorithm utilizesthe length filtering or prefix filtering which cannot be usedfor approximate substring matching. In [1] and [22], thealgorithms to find similar strings probabilistically were pro-posed, but we focused on the problem of finding the exacttop-k approximate substring matches in this paper.

In [3] and [15], the algorithms using inverted q-gram in-dexes are presented. However, as mentioned in [27], to uti-lize the inverted q-gram indexes, the length of the q-gramsto be used should be smaller than (|σ|+1)/(τ+1). Thus,depending on the value of τ and the q-grams used for ac-cessing existing inverted indexes, it is not always possible touse existing inverted q-gram indexes for approximate stringmatching [27]. In [26] and [28], the top-k approximate stringmatching algorithms using inverted q-gram indexes are pre-sented. However, as pointed out in [27], we cannot deter-mine the q-gram size in advance to build indexes and thusthese algorithms may not find the top-k approximate string

395

Page 12: Efï¬cient Top-k Algorithms for Approximate Substring Matching

matches correctly when using preexisting inverted q-gramindexes. One of our proposed algorithms also utilizes exist-ing inverted indexes, but we still guarantee the correctness.The algorithms in [6, 29] utilize the suffix tries which index

a small portion of suffixes only in the data. However, touse suffix tries for approximate substring matching, we haveto index every suffix appearing in the data, resulting largeindex sizes. Because of the high space requirements andpoor locality of suffix tries, it is mentioned in [21] that thesuffix trie based approaches are proper only when both ofthe data and index fit in main memory. Furthermore, sincethe approximate stringmatching algorithm with a maximumthreshold τ using suffix tries in [8] has exponential space andtime complexity to τ , it is hard to be extended for top-kapproximate substring matching.

Approximate substring matching: Approximate sub-string matching has been studied in the context of approxi-mate entity extraction in [4, 17, 27]. With a given thresholdand a entity dictionary, these algorithms are actually joinalgorithms which find every substring in a database suchthat the edit distance between the substring and one of thestrings in the entity dictionary is at most the threshold. Thealgorithms in [4, 27] can be extended for top-k approximatesubstring matching and thus we compared our proposed al-gorithms to our extended algorithms of [4, 27]. The algo-rithm in [17] utilizes inverted q-gram indexes but to useinverted indexes for top-k approximate substring matching,we have to ensure that the length of the q-grams used mustbe smaller than (|σ|+1)/(τk+1) where τk is the k-th small-est substring edit distance in D. Since we cannot know τkin advance to build indexes, we could not adapt the algo-rithm in [17] to our top-k approximate substring matchingproblem.To the best of our knowledge, no previous work addresses

the top-k approximate substring matching problem and ouralgorithms presented here are the first work for the problem.

8. CONCLUSIONIn this paper, we studied the problem of top-k approxi-

mate substring matching. We first proposed efficient filter-ing methods using q-grams which may enable us to skipstrings without computing the actual substring edit dis-tances to the query string. We next presented two algo-rithms TopK-LB and TopK-SPLIT which efficiently find top-k approximate substring matches by utilizing the filteringtechniques. Furthermore, we developed an improved algo-rithm TopK-INDEX which utilizes inverted q-gram indexesto speed up TopK-SPLIT. By experiment results, we showthe effectiveness and scalability of our algorithms with real-life data sets.

AcknowledgmentThis work was supported by the National Research Foun-dation of Korea(NRF) grant funded by the Korea govern-ment(MEST) (No. NRF-2009-0078828). It was also sup-ported by Next-Generation Information Computing Devel-opment Program through the National Research Foundationof Korea(NRF) funded by the Ministry of Education, Sci-ence and Technology (No. NRF-2012M3C4A7033342).

9. REFERENCES[1] A. Andoni and K. Onak. Approximating edit distance in

near-linear time. In STOC, 2009.

[2] R. A. Baeza-Yates and B. A. Ribeiro-Neto. ModernInformation Retrieval - the concepts and technology behind

search, Second edition. Pearson Education Ltd., Harlow,England, 2011.

[3] A. Behm, S. Ji, C. Li, and J. Lu. Space-constrainedgram-based indexing for efficient approximate string search.In ICDE, 2009.

[4] T. Bocek, E. Hunt, and B. Stiller. Fast similarity search inlarge dictionaries. In Technical Report, 2007.

[5] A. Z. Broder. On the resemblance and containment ofdocuments. In Proceedings of Compression and Complexityof SEQUENCES, 1997.

[6] S. Chaudhuri and R. Kaushik. Extending autocompletionto tolerate errors. In SIGMOD, 2009.

[7] Z. Chen, F. Korn, N. Koudas, and S. Muithukrishnan.Selectivity estimation for boolean queries. In PODS, 2000.

[8] R. Cole, L.-A. Gottlieb, and M. Lewenstein. Dictionarymatching and indexing with errors and don’t cares. InSTOC, 2004.

[9] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas,S. Muthukrishnan, and D. Srivastava. Approximate stringjoins in a database (almost) for free. In VLDB, 2001.

[10] L. Jin and C. Li. Selectivity estimation for fuzzy stringpredicates in large data sets. In VLDB, 2005.

[11] N. Koudas, A. Marathe, and D. Srivastava. Flexible stringmatching against large databases in practice. In VLDB,2004.

[12] H. Lee, R. T. Ng, and K. Shim. Extending q-grams toestimate selectivity of string matching with low editdistance. In VLDB, 2007.

[13] H. Lee, R. T. Ng, and K. Shim. Approximate substringselectivity estimation. In EDBT, 2009.

[14] V. I. Levenshtein. Binary codes capable of correctingdeletions, insertions and reversals. Soviet Phys. Dokl., 1985.

[15] C. Li, J. Lu, and Y. Lu. Efficient merging and filteringalgorithms for approximate string searches. In ICDE, 2008.

[16] C. Li, B. Wang, and X. Yang. VGRAM: Improvingperformance of approximate queries on string collectionsusing variable-length grams. In VLDB, 2007.

[17] G. Li, D. Deng, and J. Feng. Faerie: efficient filteringalgorithms for approximate dictionary-based entityextraction. In SIGMOD, 2011.

[18] C.-C. Liu, J.-L. Hsu, and A. L. P. Chen. An approximatestring matching algorithm for content-based music dataretrieval. In ICMCS, 1999.

[19] A. Mazeika, M. H. Bohlen, N. Koudas, and D. Srivastava.Estimating the selectivity of approximate string queries.ACM Trans. Database Syst., 32(2), 2007.

[20] G. Navarro. A guided tour to approximate string matching.ACM Comput. Surv., 33(1):31–88, 2001.

[21] G. Navarro, E. Sutinen, and J. Tarhio. Indexing text withapproximate q-grams. J. Discrete Algorithms, 2005.

[22] R. Ostrovsky and Y. Rabani. Low distortion embeddingsfor edit distance. J. ACM, 54(5), 2007.

[23] M. Sahami and T. D. Heilman. A web-based kernelfunction for measuring the similarity of short text snippets.In WWW, 2006.

[24] P. H. Sellers. The theory and computation of evolutionarydistances: Pattern recognition. J. Algorithms, 1(4), 1980.

[25] E. Ukkonen. Finding approximate patterns in strings. J.Algorithms, 6(1):132–137, 1985.

[26] R. Vernica and C. Li. Efficient top-k algorithms for fuzzysearch in string collections. In KEYS, 2009.

[27] W. Wang, C. Xiao, X. Lin, and C. Zhang. Efficientapproximate entity extraction with edit distanceconstraints. In SIGMOD, 2009.

[28] Z. Yang, J. Yu, and M. Kitsuregawa. Fast algorithms fortop-k approximate string matching. In AAAI, 2010.

[29] Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, andD. Srivastava. Bed-tree: an all-purpose index structure forstring similarity search based on edit distance. InSIGMOD, 2010.

396