a hybrid indexing method for approximate string matching

1

A Hybrid Indexing Method for Approximate String Matching

Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239, Gonzalo Navarro and Ricardo Baeza-Yates

Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shieh

2

The approximate string matching problem is:

Given a text T of length n, a pattern P of length m (n > m), and a threshold k to the number of "errors" in the matches, find all occurrences of a pattern in a text with k errors.

3

This paper uses an exhaustive searching mechanism. We open a window T’ in T with size m+k (Rule 2) and try to determine whether we are sure that every prefix T’’ of this window T’ has ed(T’’,P) > k.

If the answer is yes, we ignore this window; otherwise, we use dynamic programming to examine whether any prefix T’’ of the window T’ has ed(T’’,P) ≦k.

4

We use dynamic programming to compute the edit distance between two strings.

A matrix C0…|m|,0…|n| is filled, where Cj,i represents the minimum number of operations need to match T1…i to P1…j. This is computed as follows

Cj,0 and C0,i represent the edit distance between a string of length j or i and the empty string.

),,min(1

)(

,

1,11,,1

1,1,

,00,

ijijij

ijjiij

ij

CCCelse

CthenPTifC

iCjC

5

example:T = surgeryP = surveyk = 2

s u r g e r y

0 1 2 3 4 5 6 7

s 1 0 1 2 3 4 5 6

u 2 1 0 1 2 3 4 5

r 3 2 1 0 1 2 3 4

v 4 3 2 1 1 2 3 4

e 5 4 3 2 2 1 2 3

y 6 5 4 3 3 2 2 2

There are only three prefixes of T, namely surge, surger and surgery, whose edit distances with P=survey aresmaller than or equal to k=2.

6

Let us now see how we can be sure that for a window T’ with size m+k , for every prefix T’’ of T’, ed(T’’,P) > k.

We present Lemma 1 of this paper as follows.

7

Lemma 1Let T’ in T and P be two strings such that ed(T’,

P) ≦ k. Let P = P1x1P2x2… xj-1Pj, for strings Pi and xi and for any j ≧ 1. Then, at least one string Pi appears in T’ with at most errors.

Thus, we always divide the pattern into j pieces. We shall point out how to divide later.

jk /

8

To be more precise, we may say that if ed(T’,P) ≦ k, there exists a Pi in P and a T’’ in T’ such that

ed(Pi,T’’) .≦ jk /

9

Lemma 1 tells us that if for all Pi in P and every substring b in T’, ed(Pi,b) > , then ed(P,T’) > k.

Suppose that there is a window T’ with size m+k andfor all Pi in P and for every substring b in T’, ed(Pi,b) > .

Then, we can be sure that for every prefix T’’of T’ , for all Pi in P and every substring b in T’’, ed(Pi,b) > .

jk /

jk /

jk /

bT

T’T’’

PiP

10

Let us define the following condition.

Condition A: For all Pi in P and every substring b inT’, ed(Pi, b) >

Thus, if Condition A is satisfied, then for every prefix T’’ of T’, ed(T’’,P)>k.

In such a case, we ignore T’ and shift P one step to the right.

./ jk

11

Question, how can we be sure that the above condition is satisfied.

The approach:

For each Pi, we generate all possible modified strings Pi whose distances with Pi are smaller than or equal to k.

After generating all possible modified , we may use the suffix tree of T to find all occurrences of , for all i, in T with error less than .iP

sPi '

jk /

12

We still have the following questions:

• Question 1. How to divide P into j pieces?

• Question 2. How to generate all modified Pi’s?

• Question 3. How to find the occurrences of Pi’s in T with edit distance less than or equal to . jk /

13

Question 1: How to divide P into j pieces?

It can be proved that an optimal method is to partition P into j pieces with

, where σ is the alphabet size. We can get j pieces of P, and the size of every piece is around logσn.

nkmj log/)(

14

Question 2. How to generate all modified Pi’s?

The generation of all modified strings whose distanceswith P can be done trivially. One method can be found in [HHLS2006] which was reported by C. W. Lu.

Another method can be found in [HM2007] reportedBy L. C. Chen.

In this paper, the authors used the second method mentioned in [HM2007].

15

We can use non-deterministic finite automatons (NFA).A NFA is a five-tuple M=(Q, Σ, δ, q0 , F), where Q is a finite set of states, Σ is a finite input alphabet, δ is a mapping from Q×(Σ {ε}) into the set ∪of subsets of Q, q0 Q is an initial state, and F Q is a set of final states.

16

P = abac, k = 2.

The finite automaton M accepts Lk(P).

Lk(P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc, baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca, aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb, abcc, baac, babc, bbac, bbbc, bcac}.

}),(,|{)( kYPDYYPL Lk

2,0 3,0 4,01,00,0

2,1 3,1 4,11,1

2,2 3,2 4,2

a b a c

ca

ca

ε ε ε

ε ε ε

One matched, no error.

One matched, one error.

Two matched, two errors.

Four matched, no error.

ε

a

a a

a

a

c

b

b

bc

c

cb

one error

two errors

17

P = abac, k = 2.

The finite automaton M accepts Lk(P).

Lk(P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc, baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca, aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb, abcc, baac, babc, bbac, bbbc, bcac}.

2,0 3,0 4,01,00,0

2,1 3,1 4,11,1

2,2 3,2 4,2

a b a c

ca

ca

ε ε ε

ε ε ε

ε

a

a a

a

a

c

b

b

bc

c

cb

Recognize aa

18

Full example: T = GACACAGACCAAAGCAG n = 17 P = CAAG m = 4 k = 1

19

P = CAAG j = (m + k) / logσn = (4 + 1) / log317 = 1.9388 Therefore, we partition P into two pieces.

P1 = CAP2 = AG

According to Lemma 1, at least one piece appears in substrings of T with at most = 0 error. This means that we want to find exact matching of P1 and P2.

2/1

20

NFA with k = 1 of P1 = CA:

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

zero error

one error

NFA with k = 1 of P2 = AG:

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GG

zero error

one error

21

T = GACACGGACCAAAGCAG We construct the suffix tree of T.

A

CG

C G

CA

AA

GC

AG

$

GGACCAAAGCAG$

AC

GG

AC

CA

AA

GC

AG

$

A

AAGCAG$

CGGACCAAAGCAG$

G$

CA

AA

GC

AG

$

GG

AC

CA

AA

GC

AG

$

AC

AC

GG

AC

CA

AA

GC

AG

$

CA

AA

GC

AG

$

CA

G$

GA

CC

AA

AG

CA

G$

A

GC

AG

$

AG

CA

G$ $

CAG$11

12

13

14

15

16

$

17

10

9

8

7

6

5

4

3

2

1

22

We only need to consider the tree level from root to = 3 .

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

17log3

1,7

T = GACACGGACCAAAGCAG

23

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

A

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GG

A

NFA of P1:

NFA of P2


k = 1

24

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

A

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GGA

(not exact match)

(not exact match)


k = 1

25

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GGA

Out of active states.

(not exact match)


k = 1

26

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GG

G

(exact match)


We record positions 13 and 16 where AG occurs.

T = GACACGGACCAAAGCAG 13 16

k = 1

27

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

C

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GGC


k = 1

28

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

A

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GG


(exact match)

We record positions 3, 10 and 15 where CA occurs.


k = 1

29

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A C

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GG


(not exact match)


k = 1

30

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

G

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GG

G

(not exact match)

(not exact match)


k = 1

31

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

G

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GGG


k = 1

32

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GG




k = 1

33

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

A

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GG


(not exact match)


k = 1

34

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GG




k = 1

35

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GG

G


(not exact match)


k = 1

36

After we find all probable positions in T, we verify every substring of those positions.

The probable positions of T are: 3, 10, 13, 15, 16

We use the dynamic program to verify whether any approximate string matching occurs between T and P at the above locations .

37

The probable positions of T are 3, 10, 13, 15, 16

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m+k

G A C A C

0 1 2 3 4 5

C 1 1 2 2 3 4

A 2 2 1 2 2 3

A 3 3 2 2 2 3

G 4 3 3 3 3 3

k = 1

No approximatematching with k=1found.

38

C A A G


m+k

A C A C G

0 1 2 3 4 5

C 1 1 1 2 3 4

A 2 1 2 1 2 3

A 3 2 2 2 2 3

G 4 3 3 3 3 2

The probable positions of T are: 3, 10, 13, 15, 16 k = 1


39

C A A G


m+k

C A C G G

0 1 2 3 4 5

C 1 0 1 2 3 4

A 2 1 0 1 2 3

A 3 2 1 1 2 3

G 4 3 2 2 1 2


CACG is found.

k = 1

40

C A A G


m+k


This window does not include any probable position.Therefore we can ignore this window.

41

C A A G


m+k


The window does not include any probable position.Therefore we can shift the window directly.

42

C A A G


m+k

G G A C C

0 1 2 3 4 5

C 1 1 2 3 3 4

A 2 2 2 2 3 4

A 3 3 3 2 3 4

G 4 3 3 3 3 4



43

C A A G


m+k

G A C C A

0 1 2 3 4 5

C 1 1 2 2 3 4

A 2 2 1 2 3 3

A 3 3 2 2 3 3

G 4 3 3 3 3 4



44

C A A G


m+k

A C C A A

0 1 2 3 4 5

C 1 1 1 2 3 4

A 2 1 2 2 2 3

A 3 2 2 3 2 2

G 4 3 3 4 3 3



45

C A A G


m+k

C C A A A

0 1 2 3 4 5

C 1 0 1 2 3 4

A 2 1 1 1 2 3

A 3 2 2 1 1 2

G 4 3 3 2 2 2



46

C A A G


m+k

C A A A G

0 1 2 3 4 5

C 1 0 1 2 3 4

A 2 1 0 1 2 3

A 3 2 1 0 1 2

G 4 3 2 1 1 1


CAA, CAAA and CAAAG are found.

k = 1

47

C A A G


m+k

A A A G C

0 1 2 3 4 5

C 1 1 2 3 4 4

A 2 2 1 2 3 4

A 3 2 2 1 2 3

G 4 3 3 2 1 2


AAAG is found.

48

C A A G


m+k

A A G C A

0 1 2 3 4 5

C 1 1 2 3 3 4

A 2 1 1 2 3 3

A 3 2 1 2 3 3

G 4 3 2 1 2 3


AAG is found.

49

C A A G


m+k

A G C A G

0 1 2 3 4 5

C 1 1 2 2 3 4

A 2 1 2 3 2 3

A 3 2 2 4 3 3

G 4 3 2 5 4 3



50

C A A G


m

G C A G

0 1 2 3 4

C 1 1 1 2 3

A 2 2 2 1 2

A 3 3 3 2 2

G 4 3 3 3 2



51

C A A G


m-k

C A G

0 1 2 3

C 1 0 1 2

A 2 1 0 1

A 3 2 1 1

G 4 3 2 1


CAG is found.

52

Time complexity

• The preprocessing time complexity of constructing automatons and a suffix tree of T is O(|N|*|m|) and O(n) respectively, |N| is the number of states in a NFA and |m| is the length of m.

• The search time obtained using the partitioning scheme is O(nλlogn), where λ < 1 when error tolerated α < 1-e/ , where e = 2.718… .

53

references[AG85]Combinatorial Algorithms on Words. A. Apostolico and Z. Galil. Springer-Verlag, N

ew York, 1985.[ANZ97]Large text searching allowing errors. M. Ara´ujo, G. Navarro, and N. Ziviani. In Pr

oc. 4th South American Workshop on String Processing (WSP’97), pages 2–20. Carleton University Press, 1997.

[B92]Text retrieval: Theory and practice. R. Baeza-Yates. In 12th IFIPWorld Computer Congress, volume I, pages 465–476. Elsevier Science, September 1992.

[B96]A unified view of string matching algorithms. R. Baeza-Yates. In SOFSEM’96: Theory and Practice of Informatics, LNCS 1175, pages 1–15, 1996. Invited paper.

[BG96]Fast text searching for regular expressions or automaton searching on a trie. R. Baeza-Yates and G. Gonnet. Journal of the ACM, 43, 1996.

[BG99]A fast algorithm on average for all-against-all sequence matching. R. Baeza-Yates and G. Gonnet.In Proc. 6th Symposium on String Processing and Information Retrieval (SPIRE’99). IEEE CS Press, 1999. Previous version unpublished, Dept. of Computer Science, Univ. of Chile, 1990.

[BN99]Faster approximate string matching. R. Baeza-Yates and G. Navarro. Algorithmica, 23(2):127–158, 1999. Preliminary version in Proc. CPM’96, LNCS 1075.

[BN2000]Block-addressing indices for approximate text retrieval. R. Baeza-Yates and G. Navarro. Journal of the American Society for Information Science (JASIS), 51(1):69–82, January 2000.

[BBHECS85]The smallest automaton recognizing the subwords of a text. A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M. Chen, and J. Seiferas. Theoretical Computer Science, 40:31–55, 1985.

54

[CM94]Approximate string matching and local similarity. W. Chang and T. Marr. In Proc. 5th Annual Symposium on Combinatorial Pattern Matching (CPM’94), LNCS 807, pages 259–273, 1994.

[C95]Fast approximate matching using suffix trees. A. Cobbs. In Proc. 6th Annual Symposium on Combinatorial Pattern Matching (CPM’95), LNCS 937, pages 41–54, 1995.

[C86]Transducers and repetitions. M. Crochemore. Theoretical Computer Science, 45:63–86, 1986.

[FFM98]Overcoming the memory bottleneck in suffix tree construction. M. Farach, P. Ferragina, and S. Muthukrishnan. In Proc. 9th Symposium on Discrete Algorithms (SODA’98), pages 174–183, 1998.

[GKS99]Efficient implementation of lazy suffix trees. R. Giegerich, S. Kurtz, and J. Stoye. In Proc. 3rdWorkshop on Algorithm Engineering (WAE’99), LNCS 1668, pages 30–42, 1999.

[G92]A tutorial introduction to Computational Biochemistry using Darwin. G. Gonnet. Technical report, Informatik E.T.H., Zuerich, Switzerland, 1992.

[GBS92]Information Retrieval: Data Structures and Algorithms, chapter 3: New indices for text: Pat trees and Pat arrays. Gonnet, R. Baeza-Yates, and T. Snider. Pages 66–82. Prentice-Hall, 1992.

[H95]Overview of the Third Text REtrieval Conference. D. Harman. In Proc. Third Text REtrieval Conference (TREC-3), pages 1–19, 1995. NIST Special Publication 500-207.

[HS94]N. Holsti and E. Sutinen. Approximate string matching using q-gram places. In Proc. 7th Finnish Symposium on Computer Science, pages 23–32. University of Joensuu, 1994.

[IT99]An efficient method for in memory construction of suffix arrays. H. Itoh and H. Tanaka. In Proc. 6th Symposium on String Processing and Information Retrieval (SPIRE’99), pages 81–87. IEEE CS Press, 1999.

[JU91]Two algorithms for approximate string matching in static texts. P. Jokinen and E. Ukkonen. In Proc. 2nd Annual Symposium on Mathematical Foundations of Computer Science (MFCS’91), volume 16, pages 240–248, 1991.

55

[K73]The Art of Computer Programming, volume 3: Sorting and Searching. D. Knuth. Addison-Wesley, 1973.

[MM93]Suffix arrays: a new method for on-line string searches. U. Manber and E. Myers. SIAM Journal on Computing, pages 935–948, 1993.

[Mw94]GLIMPSE: A tool to search through entire file systems. U. Manber and S. Wu. In Proc. USENIX Technical Conference, pages 23–32, Winter 1994.

[M94]A sublinear algorithm for approximate keyword searching. E. Myers. Algorithmica, 12(4/5):345–374, Oct/Nov 1994.

[N98]Approximate Text Searching. G. Navarro. PhD thesis, Dept. of Computer Science, Univ. of Chile, December 1998. Technical Report TR/DCC-98-14. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/thesis98.ps.gz.

[N99]A guided tour to approximate string matching. G. Navarro. Technical Report TR/DCC-99-5, Dept. of Computer Science, Univ. of Chile, 1999. To appear in ACM Computing Surveys. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/survasm.ps.gz.

[NB98]Improving an algorithm for approximate pattern matching. G. Navarro and R. Baeza-Yates. Technical Report TR/DCC-98-5, Dept. of Computer Science, Univ. of Chile, 1998. Submitted.

[NB98]A practical q-gram index for text retrieval allowing errors. G. Navarro and R. Baeza-Yates. CLEI Electronic Journal, 1(2), 1998. http://www.clei.cl.

[NB99]A new indexing method for approximate string matching. G. Navarro and R. Baeza-Yates. In Proc. 10th Annual Symposium on Combinatorial Pattern Matching (CPM’99), LNCS 1645, pages 163–186, 1999.

[NB99]Very fast and simple approximate string matching. G. Navarro and R. Baeza-Yates. Information Processing Letters, 72:65–70, 1999.

[NSTT2000]Indexing text with approximate q-grams. G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. In Proc. 11th Annual Symposium on Combinatorial Pattern Matching (CPM’2000), Montreal, Canada, 2000.

56

[S98]A fast algorithm for making suffix arrays and for the Burrows-Wheeler transformation. K. Sadakane. In Proc. Data Compression Conference (DCC’98), pages 129–138, 1998.

[S80]The theory and computation of evolutionary distances: pattern recognition. P. Sellers. Journal of Algorithms, 1:359–373, 1980.

[S96]Fast approximate string matching with q-blocks sequences. F. Shi. In Proc. 3rd South American Workshop on String Processing (WSP’96), pages 257–271. Carleton University Press, 1996.

[ST95]On using q-gram locations in approximate string matching. E. Sutinen and J. Tarhio. In Proc. ESA’95, LNCS 979, pages 327–340, 1995.

[ST96]Tarhio. Filtration with q-samples in approximate string matching. E. Sutinen and J. In Proc. 7th Annual Symposium on Combinatorial Pattern Matching (CPM’96), LNCS 1075, pages 50–61, 1996.

[U96]Approximate string matching over suffix trees. E. Ukkonen. In Proc. 4th Annual Symposium on Combinatorial Pattern Matching (CPM’93), pages 228–242, 1993.

[U95]Constructing suffix trees on-line in linear time. E. Ukkonen. Algorithmica, 14(3):249–260, Sep 1995.

[U85]Finding approximate patterns in strings. Esko Ukkonen. Journal of Algorithms, 6:132–137, 1985.

[WM92]Fast text searching allowing errors. S.Wu and U. Manber. Comm. of the ACM, 35(10):83–91, October 1992.

57

Thank you

a hybrid indexing method for approximate string matching

Documents