string matching with mismatches

52
String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)

Upload: xena

Post on 21-Jan-2016

62 views

Category:

Documents


1 download

DESCRIPTION

String Matching with Mismatches. Some slides are stolen from Moshe Lewenstein (Bar Ilan University). String Matching with Mismatches. Landau – Vishkin 1986 Galil – Giancarlo 1986 Abrahamson 1987 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: String  Matching with Mismatches

String Matching with Mismatches

Some slides are stolen from Moshe Lewenstein (Bar Ilan University)

Page 2: String  Matching with Mismatches

String Matching with Mismatches

Landau – Vishkin 1986

Galil – Giancarlo 1986

Abrahamson 1987

Amir - Lewenstein - Porat 2000

Page 3: String  Matching with Mismatches

Approximate String Matching

problem: Find all text locations where distance from pattern is sufficiently small.

distance metric: HAMMING DISTANCE

Let S = s1s2…sm

R = r1r2…rm

Ham(S,R) = The number of locations j where sj rj

Example: S = ABCABC R = ABBAAC

Ham(S,R) = 2

Page 4: String  Matching with Mismatches

Example:

P = A B B A A C T = A B C A A B C A C…

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Problem 1: Counting mismatches

Page 5: String  Matching with Mismatches

Example:

P = A B B A A C T = A B C A A B C A C… 2

Ham(P,T1) = 2

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Counting mismatches

Page 6: String  Matching with Mismatches

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4

Ham(P,T2) = 4

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Counting mismatches

Page 7: String  Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6

Ham(P,T3) = 6

Counting mismatches

Page 8: String  Matching with Mismatches

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2

Ham(P,T4) = 2

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Counting mismatches

Page 9: String  Matching with Mismatches

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Counting mismatches

Page 10: String  Matching with Mismatches

Example: k = 2

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

Problem 2: k-mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: Every i where Ham(P, titi+1…ti+m-1) ≤ k

Page 11: String  Matching with Mismatches

Example: k = 2

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, … 1, 0, 0, 1,

Problem 2: k-mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: Every i where Ham(P, titi+1…ti+m-1) ≤ kh

Page 12: String  Matching with Mismatches

Naïve Algorithm(for counting mismatches or k-mismatches problem)

Running Time: O(nm) n = |T|, m = |P|

- Goto each location of text and compute hamming distance of P and Ti

Page 13: String  Matching with Mismatches

The Kangaroo Method(for k-mismatches)

Landau – Vishkin 1986

Galil – Giancarlo 1986

Page 14: String  Matching with Mismatches

The Kangaroo Method(for k-mismatches)

-Create suffix tree (+ lca) for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 15: String  Matching with Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 16: String  Matching with Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 17: String  Matching with Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 18: String  Matching with Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 19: String  Matching with Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 20: String  Matching with Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 21: String  Matching with Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

- Do up to k LCP queries for every text location

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 22: String  Matching with Mismatches

The Kangaroo Method(for k-mismatches)

Preprocess:

Build suffix tree of both P and T - O(n+m) timeLCA preprocessing - O(n+m) time

Check P at given text location

Kangroo jump till next mismatch - O(k) time

Overall time: O(nk)

Page 23: String  Matching with Mismatches

How do we do counting in less than O(nm) ?

Page 24: String  Matching with Mismatches

Lets start with binary strings

0 1 0 1 1 0 1 1P =

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 ......T =

If we pad the pattern and the text with zeros this is like a convolution of two vectors of length m+n

We can do this using FFT in O(nlog(n)) time

Page 25: String  Matching with Mismatches

a b a c c a c b a c a b a c c P =

a b b b c c c a a a a b a c b ......T =

And if the strings are not binary ?

Page 26: String  Matching with Mismatches

a b a c c a c b a c a b a c c P =

a b a c c a c b a c a b a c c a-mask

a b b b c c c a a a a b a c b ......T =

Page 27: String  Matching with Mismatches

a b a c c a c b a c a b a c c P =

a b a c c a c b a c a b a c c a-mask

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

Page 28: String  Matching with Mismatches

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

Page 29: String  Matching with Mismatches

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

a b b b c c c a a a a b a c b ......not-amask

Page 30: String  Matching with Mismatches

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

a b b b c c c a a a a b a c b ......not-amask

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a

Page 31: String  Matching with Mismatches

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a

Page 32: String  Matching with Mismatches

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a

Multiply Pa and T(not a) to count mismatches using “a”

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a ... ...

Page 33: String  Matching with Mismatches

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a

Multiply Pa and Tnot a to count mismatches using a

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a ... ...

Page 34: String  Matching with Mismatches

Boolean Convolutions (FFT) Method

Page 35: String  Matching with Mismatches

Running Time: One boolean convolution - O(n log m) time

# of matches of all symbols - O(n| | log m) timeΣ

Boolean Convolutions (FFT) Method

Page 36: String  Matching with Mismatches

How do we do counting in less than O(nm) ?

Lets count matches rather than mismatches

Page 37: String  Matching with Mismatches

a b c d e b b d

b g d e f h d c c a b g h h ...

...

...

...

P =

T =

counter

increment

For each character you have a list of offsets where it occurs in the pattern,

When you see the char in the text, you increment the appropriate counters.

Page 38: String  Matching with Mismatches

a b c d e b b d

b g d e f h d c c a b g h h ...

...

...

...

P =

T =

counter

increment

For each character you have a list of offsets where it occurs in the pattern,

When you see the char in the text, you increment the appropriate counters.

Page 39: String  Matching with Mismatches

a b c d e b b d

b g d e f h d c c a b g h h ...

...

...

...

P =

T =

counter

increment

This is fast if all characters are “rare”

Page 40: String  Matching with Mismatches

Partition the characters into rare and frequent

Rare: occurs ≤ c times in the pattern

For rare characters run this scheme with the counters

Takes O(nc) time

Page 41: String  Matching with Mismatches

Frequest chars

You have at most m/c of them

Do a convolution for each

Total cost O(m/c n log(n)).

Page 42: String  Matching with Mismatches

2

log( )

log( )

log( )

mcn n n

c

c m n

c m n

( log( ))O n m n

Fix c

Complexity:

Page 43: String  Matching with Mismatches

Frequent Symbol: a symbol that appears at least times in P.k2

Back to the k-mismatch problem

Want to beat the O(nk) kangaroo bound

Page 44: String  Matching with Mismatches

Few (≤√k) frequent symbols

Do the counters scheme for non-frequent

Convolve for each frequent O(n log n)k

O(n )k

Page 45: String  Matching with Mismatches

(≥√k) frequent symbols

Intuition: There cannot be too many places where we match

Page 46: String  Matching with Mismatches

(≥√k) frequent symbols

- Consider frequent symbols.

- For each of them consider the first appearances.

k2

k

Do the counters scheme just for these symbols and occurrences

Page 47: String  Matching with Mismatches

k = 4, = 4k2

a b a c c a c b a c a b a c c P =

a b a c c a c b a c a b a c c

a b a c c a c b a c a b a c c

a-mask

c-mask

a b b b c c c a a a a b a c b ......

T =

use a-mask

Page 48: String  Matching with Mismatches

Example of Masked Countingk = 4, = 4k2

a b a c c a c b a c a b a c c P =

a b a c c a c b a c a b a c c

a b a c c a c b a c a b a c c

a-mask

c-mask

a b b b c c c a a a a b a c b ......

T = a b a c c a c b a c a b a c c

d

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...... 1counter

Page 49: String  Matching with Mismatches

Example of Masked Countingk = 4, = 4k2

a b a c c a c b a c a b a c c P =

a b a c c a c b a c a b a c c

a b a c c a c b a c a b a c c

a-mask

c-mask

a b b b c c c a a a a b a c b ......

T = a b a c c a c b a c a b a c c

d

0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ...... 1counter

Page 50: String  Matching with Mismatches

Counting Stage:

Run through text and count occurrencesof all marks.

Time: O(n ).k

For location i of T, if counteri < k then no match at location i.

Why? The total # of elements in all masks is 2 = 2k.

Important Observations:

1) Sum of all counters 2 n 2) Every counter whose value is less than k already has more than k errors.

k k

k

Page 51: String  Matching with Mismatches

How many locations remain?

Sum of all counters: 2n

Value of potential matches > k

k

kn

kkn 22

The Kangaroo Method.

How do we check these locations?

Use

Kangaroo Method Time: O(k) per location

Overall Time: O( ) = O( )kkn

kn

# of potential matches:

Page 52: String  Matching with Mismatches

Additional Points

Can reduce to

O( n )kk log