a hybrid indexing method for approximate string matching

57
1 A Hybrid Indexing Method for Approximate String Matching ournal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-2 Gonzalo Navarro and Ricardo Baeza-Yates Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shieh

Upload: fritz

Post on 24-Jan-2016

66 views

Category:

Documents


0 download

DESCRIPTION

A Hybrid Indexing Method for Approximate String Matching. Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239, Gonzalo Navarro and Ricardo Baeza-Yates. Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shieh. The approximate string matching problem is: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Hybrid Indexing Method for  Approximate String Matching

1

A Hybrid Indexing Method for Approximate String Matching

Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp. 205-239, Gonzalo Navarro and Ricardo Baeza-Yates

Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shieh

Page 2: A Hybrid Indexing Method for  Approximate String Matching

2

The approximate string matching problem is:

Given a text T of length n, a pattern P of length m (n > m), and a threshold k to the number of "errors" in the matches, find all occurrences of a pattern in a text with k errors.

Page 3: A Hybrid Indexing Method for  Approximate String Matching

3

This paper uses an exhaustive searching mechanism. We open a window T’ in T with size m+k (Rule 2) and try to determine whether we are sure that every prefix T’’ of this window T’ has ed(T’’,P) > k.

If the answer is yes, we ignore this window; otherwise, we use dynamic programming to examine whether any prefix T’’ of the window T’ has ed(T’’,P) ≦k.

Page 4: A Hybrid Indexing Method for  Approximate String Matching

4

We use dynamic programming to compute the edit distance between two strings.

A matrix C0…|m|,0…|n| is filled, where Cj,i represents the minimum number of operations need to match T1…i to P1…j. This is computed as follows

Cj,0 and C0,i represent the edit distance between a string of length j or i and the empty string.

),,min(1

)(

,

1,11,,1

1,1,

,00,

ijijij

ijjiij

ij

CCCelse

CthenPTifC

iCjC

Page 5: A Hybrid Indexing Method for  Approximate String Matching

5

example:T = surgeryP = surveyk = 2

s u r g e r y

0 1 2 3 4 5 6 7

s 1 0 1 2 3 4 5 6

u 2 1 0 1 2 3 4 5

r 3 2 1 0 1 2 3 4

v 4 3 2 1 1 2 3 4

e 5 4 3 2 2 1 2 3

y 6 5 4 3 3 2 2 2

There are only three prefixes of T, namely surge, surger and surgery, whose edit distances with P=survey aresmaller than or equal to k=2.

Page 6: A Hybrid Indexing Method for  Approximate String Matching

6

Let us now see how we can be sure that for a window T’ with size m+k , for every prefix T’’ of T’, ed(T’’,P) > k.

We present Lemma 1 of this paper as follows.

Page 7: A Hybrid Indexing Method for  Approximate String Matching

7

Lemma 1Let T’ in T and P be two strings such that ed(T’,

P) ≦ k. Let P = P1x1P2x2… xj-1Pj, for strings Pi and xi and for any j ≧ 1. Then, at least one string Pi appears in T’ with at most errors.

Thus, we always divide the pattern into j pieces. We shall point out how to divide later.

jk /

Page 8: A Hybrid Indexing Method for  Approximate String Matching

8

To be more precise, we may say that if ed(T’,P) ≦ k, there exists a Pi in P and a T’’ in T’ such that

ed(Pi,T’’) .≦ jk /

Page 9: A Hybrid Indexing Method for  Approximate String Matching

9

Lemma 1 tells us that if for all Pi in P and every substring b in T’, ed(Pi,b) > , then ed(P,T’) > k.

Suppose that there is a window T’ with size m+k andfor all Pi in P and for every substring b in T’, ed(Pi,b) > .

Then, we can be sure that for every prefix T’’of T’ , for all Pi in P and every substring b in T’’, ed(Pi,b) > .

jk /

jk /

jk /

bT

T’T’’

PiP

Page 10: A Hybrid Indexing Method for  Approximate String Matching

10

Let us define the following condition.

Condition A: For all Pi in P and every substring b inT’, ed(Pi, b) >

Thus, if Condition A is satisfied, then for every prefix T’’ of T’, ed(T’’,P)>k.

In such a case, we ignore T’ and shift P one step to the right.

./ jk

Page 11: A Hybrid Indexing Method for  Approximate String Matching

11

Question, how can we be sure that the above condition is satisfied.

The approach:

For each Pi, we generate all possible modified strings Pi whose distances with Pi are smaller than or equal to k.

After generating all possible modified , we may use the suffix tree of T to find all occurrences of , for all i, in T with error less than .iP

sPi '

jk /

Page 12: A Hybrid Indexing Method for  Approximate String Matching

12

We still have the following questions:

• Question 1. How to divide P into j pieces?

• Question 2. How to generate all modified Pi’s?

• Question 3. How to find the occurrences of Pi’s in T with edit distance less than or equal to . jk /

Page 13: A Hybrid Indexing Method for  Approximate String Matching

13

Question 1: How to divide P into j pieces?

It can be proved that an optimal method is to partition P into j pieces with

, where σ is the alphabet size. We can get j pieces of P, and the size of every piece is around logσn.

nkmj log/)(

Page 14: A Hybrid Indexing Method for  Approximate String Matching

14

Question 2. How to generate all modified Pi’s?

The generation of all modified strings whose distanceswith P can be done trivially. One method can be found in [HHLS2006] which was reported by C. W. Lu.

Another method can be found in [HM2007] reportedBy L. C. Chen.

In this paper, the authors used the second method mentioned in [HM2007].

Page 15: A Hybrid Indexing Method for  Approximate String Matching

15

We can use non-deterministic finite automatons (NFA).A NFA is a five-tuple M=(Q, Σ, δ, q0 , F), where Q is a finite set of states, Σ is a finite input alphabet, δ is a mapping from Q×(Σ {ε}) into the set ∪of subsets of Q, q0 Q is an initial state, and F Q is a set of final states.

Page 16: A Hybrid Indexing Method for  Approximate String Matching

16

P = abac, k = 2.

The finite automaton M accepts Lk(P).

Lk(P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc, baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca, aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb, abcc, baac, babc, bbac, bbbc, bcac}.

}),(,|{)( kYPDYYPL Lk

2,0 3,0 4,01,00,0

2,1 3,1 4,11,1

2,2 3,2 4,2

a b a c

ca

ca

ε ε ε

ε ε ε

One matched, no error.

One matched, one error.

Two matched, two errors.

Four matched, no error.

ε

a

a a

a

a

c

b

b

bc

c

cb

one error

two errors

Page 17: A Hybrid Indexing Method for  Approximate String Matching

17

P = abac, k = 2.

The finite automaton M accepts Lk(P).

Lk(P)={aa, ab, ac, ba, bc, aaa, aab, aac, aba, abb, abc, acc, baa, bab, bac, bbc, bcc, aaaa, aaab, aaac, aaba, aabc, aaca, aacb, aacc, abaa, abab, abac, abba, abbb, abbc, abca, abcb, abcc, baac, babc, bbac, bbbc, bcac}.

2,0 3,0 4,01,00,0

2,1 3,1 4,11,1

2,2 3,2 4,2

a b a c

ca

ca

ε ε ε

ε ε ε

ε

a

a a

a

a

c

b

b

bc

c

cb

Recognize aa

Page 18: A Hybrid Indexing Method for  Approximate String Matching

18

Full example: T = GACACAGACCAAAGCAG n = 17 P = CAAG m = 4 k = 1

Page 19: A Hybrid Indexing Method for  Approximate String Matching

19

P = CAAG j = (m + k) / logσn = (4 + 1) / log317 = 1.9388 Therefore, we partition P into two pieces.

P1 = CAP2 = AG

According to Lemma 1, at least one piece appears in substrings of T with at most = 0 error. This means that we want to find exact matching of P1 and P2.

2/1

Page 20: A Hybrid Indexing Method for  Approximate String Matching

20

NFA with k = 1 of P1 = CA:

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

zero error

one error

NFA with k = 1 of P2 = AG:

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GG

zero error

one error

Page 21: A Hybrid Indexing Method for  Approximate String Matching

21

T = GACACGGACCAAAGCAG We construct the suffix tree of T.

A

CG

C G

CA

AA

GC

AG

$

GGACCAAAGCAG$

AC

GG

AC

CA

AA

GC

AG

$

A

AAGCAG$

CGGACCAAAGCAG$

G$

CA

AA

GC

AG

$

GG

AC

CA

AA

GC

AG

$

AC

AC

GG

AC

CA

AA

GC

AG

$

CA

AA

GC

AG

$

CA

G$

GA

CC

AA

AG

CA

G$

A

GC

AG

$

AG

CA

G$ $

CAG$11

12

13

14

15

16

$

17

10

9

8

7

6

5

4

3

2

1

Page 22: A Hybrid Indexing Method for  Approximate String Matching

22

We only need to consider the tree level from root to = 3 .

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

17log3

1,7

T = GACACGGACCAAAGCAG

Page 23: A Hybrid Indexing Method for  Approximate String Matching

23

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

A

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GG

A

NFA of P1:

NFA of P2

T = GACACGGACCAAAGCAG

k = 1

Page 24: A Hybrid Indexing Method for  Approximate String Matching

24

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

A

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GGA

(not exact match)

(not exact match)

T = GACACGGACCAAAGCAG

k = 1

Page 25: A Hybrid Indexing Method for  Approximate String Matching

25

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GGA

Out of active states.

(not exact match)

T = GACACGGACCAAAGCAG

k = 1

Page 26: A Hybrid Indexing Method for  Approximate String Matching

26

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GG

G

(exact match)

Out of active states.

We record positions 13 and 16 where AG occurs.

T = GACACGGACCAAAGCAG 13 16

k = 1

Page 27: A Hybrid Indexing Method for  Approximate String Matching

27

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

C

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GGC

T = GACACGGACCAAAGCAG

k = 1

Page 28: A Hybrid Indexing Method for  Approximate String Matching

28

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

A

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GG

Out of active states.

(exact match)

We record positions 3, 10 and 15 where CA occurs.

T = GACACGGACCAAAGCAG

k = 1

Page 29: A Hybrid Indexing Method for  Approximate String Matching

29

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A C

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GG

Out of active states.

(not exact match)

T = GACACGGACCAAAGCAG

k = 1

Page 30: A Hybrid Indexing Method for  Approximate String Matching

30

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

G

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GG

G

(not exact match)

(not exact match)

T = GACACGGACCAAAGCAG

k = 1

Page 31: A Hybrid Indexing Method for  Approximate String Matching

31

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

G

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GGG

T = GACACGGACCAAAGCAG

k = 1

Page 32: A Hybrid Indexing Method for  Approximate String Matching

32

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GG

Out of active states.

Out of active states.

T = GACACGGACCAAAGCAG

k = 1

Page 33: A Hybrid Indexing Method for  Approximate String Matching

33

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

A

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GG

Out of active states.

(not exact match)

T = GACACGGACCAAAGCAG

k = 1

Page 34: A Hybrid Indexing Method for  Approximate String Matching

34

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GG

Out of active states.

Out of active states.

T = GACACGGACCAAAGCAG

k = 1

Page 35: A Hybrid Indexing Method for  Approximate String Matching

35

A

CG

C G

C G

A

A

AC

G

CA G

G

AC C

AG

A

A

G

A

$C11

12

13

14

15

16

$

17

10 98

6

543

2

1,7

0,0 1,0 2,0

1,1 2,1

C A

A

ε εC A A

0,0 1,0 2,0

1,1 2,1

A G

G

ε εA GG

G

Out of active states.

(not exact match)

T = GACACGGACCAAAGCAG

k = 1

Page 36: A Hybrid Indexing Method for  Approximate String Matching

36

After we find all probable positions in T, we verify every substring of those positions.

The probable positions of T are: 3, 10, 13, 15, 16

We use the dynamic program to verify whether any approximate string matching occurs between T and P at the above locations .

Page 37: A Hybrid Indexing Method for  Approximate String Matching

37

The probable positions of T are 3, 10, 13, 15, 16

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m+k

G A C A C

0 1 2 3 4 5

C 1 1 2 2 3 4

A 2 2 1 2 2 3

A 3 3 2 2 2 3

G 4 3 3 3 3 3

k = 1

No approximatematching with k=1found.

Page 38: A Hybrid Indexing Method for  Approximate String Matching

38

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m+k

A C A C G

0 1 2 3 4 5

C 1 1 1 2 3 4

A 2 1 2 1 2 3

A 3 2 2 2 2 3

G 4 3 3 3 3 2

The probable positions of T are: 3, 10, 13, 15, 16 k = 1

No approximatematching with k=1found.

Page 39: A Hybrid Indexing Method for  Approximate String Matching

39

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m+k

C A C G G

0 1 2 3 4 5

C 1 0 1 2 3 4

A 2 1 0 1 2 3

A 3 2 1 1 2 3

G 4 3 2 2 1 2

The probable positions of T are: 3, 10, 13, 15, 16

CACG is found.

k = 1

Page 40: A Hybrid Indexing Method for  Approximate String Matching

40

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m+k

The probable positions of T are: 3, 10, 13, 15, 16

This window does not include any probable position.Therefore we can ignore this window.

Page 41: A Hybrid Indexing Method for  Approximate String Matching

41

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m+k

The probable positions of T are: 3, 10, 13, 15, 16

The window does not include any probable position.Therefore we can shift the window directly.

Page 42: A Hybrid Indexing Method for  Approximate String Matching

42

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m+k

G G A C C

0 1 2 3 4 5

C 1 1 2 3 3 4

A 2 2 2 2 3 4

A 3 3 3 2 3 4

G 4 3 3 3 3 4

The probable positions of T are: 3, 10, 13, 15, 16 k = 1

No approximatematching with k=1found.

Page 43: A Hybrid Indexing Method for  Approximate String Matching

43

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m+k

G A C C A

0 1 2 3 4 5

C 1 1 2 2 3 4

A 2 2 1 2 3 3

A 3 3 2 2 3 3

G 4 3 3 3 3 4

The probable positions of T are: 3, 10, 13, 15, 16 k = 1

No approximatematching with k=1found.

Page 44: A Hybrid Indexing Method for  Approximate String Matching

44

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m+k

A C C A A

0 1 2 3 4 5

C 1 1 1 2 3 4

A 2 1 2 2 2 3

A 3 2 2 3 2 2

G 4 3 3 4 3 3

The probable positions of T are: 3, 10, 13, 15, 16 k = 1

No approximatematching with k=1found.

Page 45: A Hybrid Indexing Method for  Approximate String Matching

45

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m+k

C C A A A

0 1 2 3 4 5

C 1 0 1 2 3 4

A 2 1 1 1 2 3

A 3 2 2 1 1 2

G 4 3 3 2 2 2

The probable positions of T are: 3, 10, 13, 15, 16 k = 1

No approximatematching with k=1found.

Page 46: A Hybrid Indexing Method for  Approximate String Matching

46

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m+k

C A A A G

0 1 2 3 4 5

C 1 0 1 2 3 4

A 2 1 0 1 2 3

A 3 2 1 0 1 2

G 4 3 2 1 1 1

The probable positions of T are: 3, 10, 13, 15, 16

CAA, CAAA and CAAAG are found.

k = 1

Page 47: A Hybrid Indexing Method for  Approximate String Matching

47

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m+k

A A A G C

0 1 2 3 4 5

C 1 1 2 3 4 4

A 2 2 1 2 3 4

A 3 2 2 1 2 3

G 4 3 3 2 1 2

The probable positions of T are: 3, 10, 13, 15, 16 k = 1

AAAG is found.

Page 48: A Hybrid Indexing Method for  Approximate String Matching

48

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m+k

A A G C A

0 1 2 3 4 5

C 1 1 2 3 3 4

A 2 1 1 2 3 3

A 3 2 1 2 3 3

G 4 3 2 1 2 3

The probable positions of T are: 3, 10, 13, 15, 16 k = 1

AAG is found.

Page 49: A Hybrid Indexing Method for  Approximate String Matching

49

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m+k

A G C A G

0 1 2 3 4 5

C 1 1 2 2 3 4

A 2 1 2 3 2 3

A 3 2 2 4 3 3

G 4 3 2 5 4 3

The probable positions of T are: 3, 10, 13, 15, 16 k = 1

No approximatematching with k=1found.

Page 50: A Hybrid Indexing Method for  Approximate String Matching

50

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m

G C A G

0 1 2 3 4

C 1 1 1 2 3

A 2 2 2 1 2

A 3 3 3 2 2

G 4 3 3 3 2

The probable positions of T are: 3, 10, 13, 15, 16 k = 1

No approximatematching with k=1found.

Page 51: A Hybrid Indexing Method for  Approximate String Matching

51

C A A G

G A C A C G G A C C A A A G C A G1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

m-k

C A G

0 1 2 3

C 1 0 1 2

A 2 1 0 1

A 3 2 1 1

G 4 3 2 1

The probable positions of T are: 3, 10, 13, 15, 16 k = 1

CAG is found.

Page 52: A Hybrid Indexing Method for  Approximate String Matching

52

Time complexity

• The preprocessing time complexity of constructing automatons and a suffix tree of T is O(|N|*|m|) and O(n) respectively, |N| is the number of states in a NFA and |m| is the length of m.

• The search time obtained using the partitioning scheme is O(nλlogn), where λ < 1 when error tolerated α < 1-e/ , where e = 2.718… .

Page 53: A Hybrid Indexing Method for  Approximate String Matching

53

references[AG85]Combinatorial Algorithms on Words. A. Apostolico and Z. Galil. Springer-Verlag, N

ew York, 1985.[ANZ97]Large text searching allowing errors. M. Ara´ujo, G. Navarro, and N. Ziviani. In Pr

oc. 4th South American Workshop on String Processing (WSP’97), pages 2–20. Carleton University Press, 1997.

[B92]Text retrieval: Theory and practice. R. Baeza-Yates. In 12th IFIPWorld Computer Congress, volume I, pages 465–476. Elsevier Science, September 1992.

[B96]A unified view of string matching algorithms. R. Baeza-Yates. In SOFSEM’96: Theory and Practice of Informatics, LNCS 1175, pages 1–15, 1996. Invited paper.

[BG96]Fast text searching for regular expressions or automaton searching on a trie. R. Baeza-Yates and G. Gonnet. Journal of the ACM, 43, 1996.

[BG99]A fast algorithm on average for all-against-all sequence matching. R. Baeza-Yates and G. Gonnet.In Proc. 6th Symposium on String Processing and Information Retrieval (SPIRE’99). IEEE CS Press, 1999. Previous version unpublished, Dept. of Computer Science, Univ. of Chile, 1990.

[BN99]Faster approximate string matching. R. Baeza-Yates and G. Navarro. Algorithmica, 23(2):127–158, 1999. Preliminary version in Proc. CPM’96, LNCS 1075.

[BN2000]Block-addressing indices for approximate text retrieval. R. Baeza-Yates and G. Navarro. Journal of the American Society for Information Science (JASIS), 51(1):69–82, January 2000.

[BBHECS85]The smallest automaton recognizing the subwords of a text. A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M. Chen, and J. Seiferas. Theoretical Computer Science, 40:31–55, 1985.

Page 54: A Hybrid Indexing Method for  Approximate String Matching

54

[CM94]Approximate string matching and local similarity. W. Chang and T. Marr. In Proc. 5th Annual Symposium on Combinatorial Pattern Matching (CPM’94), LNCS 807, pages 259–273, 1994.

[C95]Fast approximate matching using suffix trees. A. Cobbs. In Proc. 6th Annual Symposium on Combinatorial Pattern Matching (CPM’95), LNCS 937, pages 41–54, 1995.

[C86]Transducers and repetitions. M. Crochemore. Theoretical Computer Science, 45:63–86, 1986.

[FFM98]Overcoming the memory bottleneck in suffix tree construction. M. Farach, P. Ferragina, and S. Muthukrishnan. In Proc. 9th Symposium on Discrete Algorithms (SODA’98), pages 174–183, 1998.

[GKS99]Efficient implementation of lazy suffix trees. R. Giegerich, S. Kurtz, and J. Stoye. In Proc. 3rdWorkshop on Algorithm Engineering (WAE’99), LNCS 1668, pages 30–42, 1999.

[G92]A tutorial introduction to Computational Biochemistry using Darwin. G. Gonnet. Technical report, Informatik E.T.H., Zuerich, Switzerland, 1992.

[GBS92]Information Retrieval: Data Structures and Algorithms, chapter 3: New indices for text: Pat trees and Pat arrays. Gonnet, R. Baeza-Yates, and T. Snider. Pages 66–82. Prentice-Hall, 1992.

[H95]Overview of the Third Text REtrieval Conference. D. Harman. In Proc. Third Text REtrieval Conference (TREC-3), pages 1–19, 1995. NIST Special Publication 500-207.

[HS94]N. Holsti and E. Sutinen. Approximate string matching using q-gram places. In Proc. 7th Finnish Symposium on Computer Science, pages 23–32. University of Joensuu, 1994.

[IT99]An efficient method for in memory construction of suffix arrays. H. Itoh and H. Tanaka. In Proc. 6th Symposium on String Processing and Information Retrieval (SPIRE’99), pages 81–87. IEEE CS Press, 1999.

[JU91]Two algorithms for approximate string matching in static texts. P. Jokinen and E. Ukkonen. In Proc. 2nd Annual Symposium on Mathematical Foundations of Computer Science (MFCS’91), volume 16, pages 240–248, 1991.

Page 55: A Hybrid Indexing Method for  Approximate String Matching

55

[K73]The Art of Computer Programming, volume 3: Sorting and Searching. D. Knuth. Addison-Wesley, 1973.

[MM93]Suffix arrays: a new method for on-line string searches. U. Manber and E. Myers. SIAM Journal on Computing, pages 935–948, 1993.

[Mw94]GLIMPSE: A tool to search through entire file systems. U. Manber and S. Wu. In Proc. USENIX Technical Conference, pages 23–32, Winter 1994.

[M94]A sublinear algorithm for approximate keyword searching. E. Myers. Algorithmica, 12(4/5):345–374, Oct/Nov 1994.

[N98]Approximate Text Searching. G. Navarro. PhD thesis, Dept. of Computer Science, Univ. of Chile, December 1998. Technical Report TR/DCC-98-14. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/thesis98.ps.gz.

[N99]A guided tour to approximate string matching. G. Navarro. Technical Report TR/DCC-99-5, Dept. of Computer Science, Univ. of Chile, 1999. To appear in ACM Computing Surveys. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/survasm.ps.gz.

[NB98]Improving an algorithm for approximate pattern matching. G. Navarro and R. Baeza-Yates. Technical Report TR/DCC-98-5, Dept. of Computer Science, Univ. of Chile, 1998. Submitted.

[NB98]A practical q-gram index for text retrieval allowing errors. G. Navarro and R. Baeza-Yates. CLEI Electronic Journal, 1(2), 1998. http://www.clei.cl.

[NB99]A new indexing method for approximate string matching. G. Navarro and R. Baeza-Yates. In Proc. 10th Annual Symposium on Combinatorial Pattern Matching (CPM’99), LNCS 1645, pages 163–186, 1999.

[NB99]Very fast and simple approximate string matching. G. Navarro and R. Baeza-Yates. Information Processing Letters, 72:65–70, 1999.

[NSTT2000]Indexing text with approximate q-grams. G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. In Proc. 11th Annual Symposium on Combinatorial Pattern Matching (CPM’2000), Montreal, Canada, 2000.

Page 56: A Hybrid Indexing Method for  Approximate String Matching

56

[S98]A fast algorithm for making suffix arrays and for the Burrows-Wheeler transformation. K. Sadakane. In Proc. Data Compression Conference (DCC’98), pages 129–138, 1998.

[S80]The theory and computation of evolutionary distances: pattern recognition. P. Sellers. Journal of Algorithms, 1:359–373, 1980.

[S96]Fast approximate string matching with q-blocks sequences. F. Shi. In Proc. 3rd South American Workshop on String Processing (WSP’96), pages 257–271. Carleton University Press, 1996.

[ST95]On using q-gram locations in approximate string matching. E. Sutinen and J. Tarhio. In Proc. ESA’95, LNCS 979, pages 327–340, 1995.

[ST96]Tarhio. Filtration with q-samples in approximate string matching. E. Sutinen and J. In Proc. 7th Annual Symposium on Combinatorial Pattern Matching (CPM’96), LNCS 1075, pages 50–61, 1996.

[U96]Approximate string matching over suffix trees. E. Ukkonen. In Proc. 4th Annual Symposium on Combinatorial Pattern Matching (CPM’93), pages 228–242, 1993.

[U95]Constructing suffix trees on-line in linear time. E. Ukkonen. Algorithmica, 14(3):249–260, Sep 1995.

[U85]Finding approximate patterns in strings. Esko Ukkonen. Journal of Algorithms, 6:132–137, 1985.

[WM92]Fast text searching allowing errors. S.Wu and U. Manber. Comm. of the ACM, 35(10):83–91, October 1992.

Page 57: A Hybrid Indexing Method for  Approximate String Matching

57

Thank you