a fast algorithm for approximate string matching on gene sequences

04/19/23 1

A fast algorithm for approximate string matching on gene sequencesZheng Liu, Xin Chen, James Borneman and Tao Jiang

University of California, Riverside

204/19/23

Outline

Background and motivation Idea and analysis for FAAST Experimental results Conclusion

304/19/23

Background

Approximate string matching

pattern: P = p1p2…pm

text: T = t1t2…tn

K-mismatch K-difference

Applications: text processing and gene sequence analysis.

404/19/23

Motivation of FAAST

Motivation: Gene sequence acquisition

Modeled as the k-mismatch problem

Primers: AAGTC CCGTA

AAGTC………CCGTATACTT………CCGTT

…ACGTC………GCGTA

AAGTC………CCGTA…

ACGTC………GCGTA

504/19/23

Algorithms for the k-mismatch problem 1992, Shift-Add by Baeza-Yates and Gonnet. 1996, BM with Shift-Add by El-Mabrouk and

Crochemore. 1993, BM extention (bad-charcter rule) by

Tarhio-Ukkonen. 1994, BM extention (good-suffix rule) by Baeza-

Yates and Gonnet.

604/19/23

Further generalization on Tarhio-Ukkonen algorithm.

tj-m+1 tj-m+2 …… tj-k … tj-2 tj-1 tj

p1 p2 …… pm-k … pm-2 pm-1 pm --

check last k+1

tj-m+1 tj-m+2 … tj-k-x+1… tj-k … tj-2 tj-1 tj

p1 p2 … pm-k-x+1 …pm-k …pm-2 pm-1 pm --check

last k+x

704/19/23

Algorithm outline

T: AACTGTTAACTTGCGACTAG (k=2, x=2)

P: AAATCGTAAC

AAATCGTAAC Χ AAATCGTAAC Χ

……… Χ AAATCGTAAC ☺ -after first shift

804/19/23

An example

k=2, x=3, m=10, n=20

T: AACTGTTAACTTGCGACTAG

P: AAATCGTAAC

T: AACTGTTAACTTGCGACTAG P: AAATCGTAAC shift 1 by Tarhio-Ukkonen

T: AACTGTTAACTTGCGACTAG P: AAATCGTAAC shift 6 by FAAST

904/19/23

Construction of shift table

Heuristic: Guarantee the last k+x (or y, if y ≤ k+x) aligned text characters to have at least x (or y-k , if y ≤ k+x) matches.

T:AACTGTTAACTTGCGACTA [K=2,X=3] P:AAGTCGTAAC

…. AAGTCGTAAC

1004/19/23

Construction details

Vkx[tj-k-x+1…tj, l] :Marks the characters that match with P after shifting P by l.

dkx[tj-k-x+1…tj] : Stores the minimum distance l, s.t. Vkx[tj-k-x+1…tj, l] contains at least min[x, m-k-l] items.

1104/19/23

Construction details – cont’d P: AAGTCGTAAC (k=2, x=3, l=[1..8]), Vkx[tj-k-x+1…tj, l] and dkx[tj-k-

x+1…tj]

l 1 2 3 4 5 6 7 8 dkx

AAAAA 0,1 0 4 3,4 2,3 1,2 0,1 6

GCGAC 1 2,3 4 0,2 1 1 7

GTCGT 0,1,2,3,4

TTAAC 0 4 3 0 2 1,2 1 7

TTTTT 2 1,4 0,3 2 1 0 8

1204/19/23

Theoretical support

Correctness of FAAST We use random string assumption Average shift distance Total number of character comparisons

1304/19/23

Correctness of FAAST

Theorem 1. When P is aligned with tj-k-x+1…tj, we can always shift P by dkx[tj-k-x+1…tj] to the right without miss approximate occurrences of P.

tj-m+1 tj-m+2 … tj-k-x +1 … tj-2 tj-1 tj

p1 p2 …pi-k-x+1 … pi-2 pi-1 pi ……

pm – current

p1 p2 … pi-k-x+1 … pi’-k-x+1 … pi-’2 pi’-1 pi’...pm -- (i

< i’)

1404/19/23

Average shift distance

Lemma 1. The prob. Pkx for the last k+x chars of T to have at least x matches is:

Pkx = 1- Σi=0x-1Ck+X

i(1-p)k+x-ipi

Theorem 1. The avg. shift distance of FAAST is:

Ekxd = Σs=0

∞s(1-Pkx)s-1Pkx = 1/Pkx

1504/19/23

Average shift distance under diff x.

1604/19/23

Total character comparisons

Lemma 2. The expected number of comparisons between two shifts is:

Ekxc = (k+X) / (1-p)

Theorem 2. The expected total comparisons for text of length n is:

TEkxc = nPkx (k+X) / (1-p)

1704/19/23

Total character comparisons

1804/19/23

Difference of total character comparisons under different x

1904/19/23

Experimental result

A PC with 2.8GHz CPU and 1G memory

Simulated random string testing

Real DNA gene sequence data

2004/19/23

Result on simulated sequences

Text: 2M bases sequence, Pattern: 39 bases, k=3.

x 1 2 3 4 5 6 7Ave. shift

dist.1.41 2.76 5.59 16.38 31.31 37.37 38.87

Total comp. 6.70 3.68 1.86 0.65 0.34 0.28 0.27

Running time(sec.)

210.2 114.4 58.1 20.6 11.2 10.8 16.7

Prepro. Time(sec.)

0.01 0.01 0.03 0.08 0.36 1.58 6.90

2104/19/23

Result on real sequences Text: 150 bacteria DNA sequences, k=3

x 1 2 3 4 5 6 7

Running time (sec.)

18.87 13.05 7.74 3.84 2.63 3.21 8.55

Prepro. Time(sec.)

0.01 0.01 0.02 0.09 0.35 1.57 6.96

matching Time(sec.)

18.77 13.04 7.72 3.75 2.28 1.64 1.59

Text: 150 fungi DNA sequences, k=3

x 1 2 3 4 5 6 7

Running time (sec.)

16.45 11.43 9.24 6.78 5.62 8.24 26.48

Prepro. Time(sec.)

0.02 0.03 0.08 0.32 1.34 5.77 23.86

matching Time(sec.)

16.43 11.40 9.16 6.46 4.28 2.47 2.62

2204/19/23

Conclusion

Competitive algorithm for k-mismatch problem on gene sequence.

Time and memory increase with larger x and alphabet size.

a fast algorithm for approximate string matching on gene sequences

Documents

shang tries for approximate string matching

an improved algorithm for approximate string matching

parallel approximate string matching applied to occluded

accuracy of approximate string joins using grams … ·...

a guided tour to approximate string matching - math/cs

approximate boyer-moore string...

filter algorithms for approximate string matching

practical methods for approximate string matching

a guided tour to approximate string...

approximate boyer-moore string matching

approximate string processing

efficient approximate search on string collections part...

1 rules for approximate string matching r.c.t. lee

accelerating approximate string matching in heterogeneous...

approximate matching (string algorithms 2007)

approximate string matching with dynamic programming and...

efficient merging and filtering algorithms for approximate...

efficient approximate search on string collections part i

spatial approximate string search doc

filter algorithms for approximate string matching stefan...