improved single and multiple approximate string matchingstelo/cpm/cpm04/35_navarro.pdf · improved...

26
Improved Single and Multiple Approximate String Matching Kimmo Fredriksson Department of Computer Science, University of Joensuu, Finland Gonzalo Navarro Department of Computer Science, University of Chile CPM’04 – p.1/26

Upload: buicong

Post on 21-Sep-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

Improved Single and Multiple ApproximateString Matching

Kimmo Fredriksson

Department of Computer Science, University of Joensuu, Finland

Gonzalo Navarro

Department of Computer Science, University of Chile

CPM’04 – p.1/26

The Problem Setting & Complexity

� Given text

����� � � and pattern

� ��� � � over some finitealphabet

of size , find the approximateoccurrences of

from

, allowing at most

differences (edit operations).

� Exact matching (single pattern) lower bound:� �� ��� ��� � � � � character comparisons (Yao, 79).

� Approximate matcing lower bound:� �� ��� � � ��� � � � � � (Chang & Marr, 94).

� We will search simultaneously a set� � � ���! ��"! # # # �%$ &of ' patterns.

� � �� ��� � � �(� ' � � � � � lower bound for ' patterns(Fredriksson & Navarro, 2003)

CPM’04 – p.2/26

Previous work

� Only a few algorithms exist for multipatternapproximate searching under the

�differences

model.

� Naïve approach: search the ' patterns separately,using any of the single pattern search algorithms.

� (Muth & Manber, 1996):� � '� � � �

average timealgorithm using

� � " ' � space. The algorithm isbased on hashing, and works only for

� � �

.

� (Baeza-Yates & Navarro, 1997):

� Partitioning into exact search:

� �� �

on average(

� ' � � preprocessing), but can be improved to

� � ��� � � ' � � � � � � . Works for

� � � � � � � � � � ' � � .

� Other less interesting ones. CPM’04 – p.3/26

Previous work

� (Fredriksson & Navarro, 2003): The firstaverage-optimal algorithm.

� average-optimal

� �� ��� ��� � � ' � � � � � up to errorlevel

� � � � � ��

.

� linear

� �� �

on average up to error level� � � � � ��

.

� (Hyyrö, Fredriksson & Navarro, 2004):

� �� � ' �� � � � �� �

worst case for short patterns,where � is the number of bits in machine word.

CPM’04 – p.4/26

This work

� We have improved the (optimal) algorithm of(Fredriksson & Navarro, 2003)

� Faster in practice, and...

� ...allows error levels up to� � � � � ��

.

� Our algorithm runs in

� �� �� ��� � ' � � � � � averagetime, which is optimal.

� Preprocessing time is� ' � � � � � , and the

algorithm needs� ' � �

space, where

� � � � � � � ' � � � .

� The fastest algorithm in practice for intermediate� � � and small .

CPM’04 – p.5/26

The method in brief:

� The algorithm is based on thepreprocessing/filtering/verification paradigm.

� The preprocessing phase generates all �

stringsof lenght

, and computes their minimum distanceover the set of patterns.

� The filtering phase searches (approximately) text

-grams from the patterns, using the precomputeddistance table, accumulating the differences.

� The verification phase uses dynamic programmingalgorithm, and is applied to each patternseparately.

CPM’04 – p.6/26

Preprocessing

� Build a table

as follows:

1. Choose a number

in the range� � � � � � �

2. For every string

of length�

(�

–gram), searchfor

in

� ��� �

3. Store in

� � � �

the smallest number of differencesneeded to match

inside

(a number between0 and

).

� �

requires space for �

entries and can becomputed in

� ' � � � � � time.

CPM’04 – p.7/26

Filtering

� Any occurrence is at least � � �

characters long

use a sliding window of � � �

characters over

� Invariant: all occurreces starting before thewindow are already reported.

� Read

-grams

� � � " # # # ��from the text window,

from right to left:

T: S3 S2 S1

text window m−k characters

� Any occurrence starting at the beginning of thewindow must contain all the

-grams read.

CPM’04 – p.8/26

Filtering

� Accumulate a sum of necessary differences:

� � �� �� �

� � � � �

.

� If � becomes � �

for some (i.e. the smallest) �,then no occurrence can contain the

-grams�� �# # # � � " � � �

slide the window past the first character of

� �

.

E.g.

� � � � � � � � � " � � �:

new window position

T:

S3 S2 S1

text window m−k characters

T:

CPM’04 – p.9/26

Verification

� If �

� �

, then the window might contain anoccurrence

the occurrence can be �� �

characters long, soverify the area

���� � � � � � ��� � , where�

is the startingposition of the window

T: S3 S2 S1

text window m−k characters

verification area m+k characters

� The verification is done for each of the ' patterns,using standard dynamic programming algorithm.

CPM’04 – p.10/26

Stricter matching condition

� Our basic algorithm: text

-grams can matchanywhere inside the patterns.

If � � �

, then we know that no occurrence cancontain the

-grams

� � �# # # � � �in any position.

� The matching area can be made smaller withoutlosing this property.

CPM’04 – p.11/26

Stricter matching condition

� Consider an approximate occurrence of� " � � �

inside the pattern.

� � "

cannot be closer than

positions from theend of the pattern.

For

� "

precompute a table

� " , which considersits best match in the area

� ��� � � �� � rather than����� � � �.

� In general, for� �preprocess a table

� �, usingthe area

� ��� � � �� � �� � � �

� Compute � as

� � � � � � � � � �

CPM’04 – p.12/26

Stricter matching condition

T:P:

text window

Area for 1 S[ ]D

Area for 2 S[ ]D

Area for 3 S[ ]D

1

2

3

S3 S2 S1

CPM’04 – p.13/26

Stricter matching condition

� � � � � � � � � � �

for any � and

the smallest � that permits shifting the window isnever smaller than for the basic method.

this variant never examines more

-grams, verifiesmore windows, nor shifts less.

� Drawback: needs more space and preprocessingeffort

Can be slower in practice.

� The matching condition can be made even stricter

� Work less per window...

� ...but the shift can be smaller. CPM’04 – p.14/26

Analysis

� It can be shown that the basic algorithm hasoptimal average case complexity

� �� �� ��� ��� ' � � � � � .This holds for

� � � � � � � � � � � � .

� The worst case complexity can be made

� �� � ' �� �

(filtering � verification).

� The preprocessing cost is

� � � ' � � � � � �

, and itrequires

� � � ' " � � � � �space.

� Since the algorithm with the stricter matchingcondition is never worse than the basic version, itis also optimal.

CPM’04 – p.15/26

Analysis

� For a single pattern our complexity is the same asthe algorithm of Chang & Marr, i.e.

� �� �� ��� �(� � � � � � ...

� ...but our filter works up to

� � � � � �� � � � � � ,whereas the filter of Chang & Marr works only upto

� � � � � �� � � � � � .

CPM’04 – p.16/26

Experimental results

� Implementation in C, compiled using icc 7.1with full optimizations, run in a 2GHZ Pentium 4,with 512MB RAM, running Linux 2.4.18.

� Experiments for alphabet sizes � �

(DNA) and � � �

(proteins), both random and real texts.

� Text lengths were 64Mb, and patterns 64characters.

� In the implementation we used several practicalimprovements described in (Fredriksson &Navarro, 2003)

� Bit-parallel counters

� Hierarchical / bit-parallel verificationCPM’04 – p.17/26

Experimental results

� We used

� � �

for DNA, and

� � �

for proteins.

� the maximum values we can use in practice,otherwise the preprocessing cost becomes toohigh.

� Analytical results:

� � � � �# # # � �

for DNA, and

� � �# # # � �

for proteins(depending on ').

Altought our algorithms are fast, in practice theycannot cope with as high difference ratios aspredicted by the analysis.

CPM’04 – p.18/26

Experimental results

� Comparison against:

CM: Our previous optimal filtering algorithmLT: Our previous linear time filterEXP: Partitioning into exact searchMM: Muth & Manber algorithm, works only for

� � �

ABNDM: Approximate BNDM algorithm, a singlepattern approximate search algorithm extendingclassical BDM.

BPM: Bit-parallel Myers, currently the bestnon-filtering algorithm for single patterns.

CPM’04 – p.19/26

Experimental results

� Comparison against Muth and Manber (� � �

):

� � � � � �� � � � � � � �� �

Alg. DNA

MM 1.30 3.97 12.86 42.52

Ours 0.08 0.12 0.21 0.54

Alg. proteins

MM 1.17 1.19 1.26 2.33

Ours 0.08 0.11 0.18 0.59CPM’04 – p.20/26

Experimental results

� � �

, random DNA

0.1

1

2 4 6 8 10 12 14 16

time

(s)

kOurs, l=6Ours, l=8

Ours, strict

Ours, strictestCMLT

EXPBPM

ABNDMCPM’04 – p.21/26

Experimental results

� � � �

, random DNA

0.1

1

10

100

2 4 6 8 10 12 14

time

(s)

kOurs, l=6Ours, l=8

Ours, strict

Ours, strictestCMLT

EXPBPM

ABNDMCPM’04 – p.22/26

Experimental results

� � �

, random proteins

0.1

1

10

2 4 6 8 10 12 14 16

time

(s)

kOurs

Ours, stricterOurs, strictest

CMLT

EXP

BPM ABNDM

CPM’04 – p.23/26

Experimental results

� � � �

, random proteins

0.1

1

10

100

2 4 6 8 10 12 14 16

time

(s)

kOurs

Ours, stricterOurs, strictest

CMLT

EXP

BPM ABNDM

CPM’04 – p.24/26

Experimental results

� Areas where each algorithm performs best. Fromleft to right, DNA ( � � � �

), and proteins ( � � � �

).Top row: random data. bottom row: real data.

256

64

16

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

BPM

OursEXP

k

r

256

64

16

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

BPM

OursEXP

k

r

256

64

16

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Ours

k

r

EXP

256

64

16

1

Ours

k

r

EXP

P

EX

EXP

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

CPM’04 – p.25/26

Conclusions

� Our new algorithm becomes the fastest for low

.

� The larger ', the smaller

values are tolerated.

� When applied to just one pattern, our algorithmbecomes the fastest for low difference ratios.

� Our basic algorithm usually beats the extensions.

� True only if we use the same parameter

forboth algorithms.

� For limited memory we can use the strictermatching condition with smaller

, and beat thebasic algorithm

� Our algorithm would be favored on even longertexts (relative preprocessing cost decreases).

CPM’04 – p.26/26