filter algorithms for approximate string matching

39
Filter Algorithms Filter Algorithms for for Approximate String Approximate String Matching Matching Stefan Burkhardt

Upload: cherokee-koch

Post on 31-Dec-2015

53 views

Category:

Documents


0 download

DESCRIPTION

Filter Algorithms for Approximate String Matching. Stefan Burkhardt. Outline. Motivation Filter Algorithms Gapped q -grams Experimental Analysis. Problems and Motivation. Motivation Computational Biology: EST Clustering Assembly Genome comparison (e.g. Human/Mouse) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Filter Algorithms for Approximate String Matching

Filter Algorithms forFilter Algorithms forApproximate String MatchingApproximate String Matching

Stefan Burkhardt

Page 2: Filter Algorithms for Approximate String Matching

OutlineOutline

Motivation Filter Algorithms Gapped q-grams Experimental Analysis

Page 3: Filter Algorithms for Approximate String Matching

Motivation

Computational Biology: EST Clustering Assembly Genome comparison (e.g. Human/Mouse)

Information Retrieval Phonebooks Dictionaries Search Engines

Many more….

Why ?

Approximate String Matching

Edit and Hamming Distance

Problems and Motivation

Page 4: Filter Algorithms for Approximate String Matching

The global approximate

string matching problem

Given a pattern P, a target S, an

error level k and a string distance d(x,y):

Find all substrings y from S with:

Why ?

Approximate String Matching

Edit and Hamming Distance

Problems and Motivation

kyP ),d(

P

S

GAT

ACTGATAACGTTAGCCATGG

Page 5: Filter Algorithms for Approximate String Matching

The global approximate

string matching problem

d(x,y) = Hamming Distance:

The k-mismatches problem

d(x,y) = Edit Distance:

The k-differences problem

Why ?

Approximate String Matching

Edit and Hamming Distance

Problems and Motivation

P

S

GAT

ACTGATAACGTTAGCCATGG

Page 6: Filter Algorithms for Approximate String Matching

How?

BLAST

The q-gram Lemma and QUASAR

Filter Algorithms

P

S

Potential Matches

FilterAlgorithm

Filtration Phase,apply Filter Criterion

ExactAlgorithm

Verification Phase,examine Potential Matches

False Matches True Matches

Page 7: Filter Algorithms for Approximate String Matching

How?

BLAST

The q-gram Lemma and QUASAR

Filter Algorithms

BLAST (Altschul, Karlin, et al.) :

S

P

Problem for high similarity: sequential scan quite time consuming

single q-grams unspecific

Sequential scan of S locates all matching q-grams with P

Iterative extension with cutoff to find good matches

Page 8: Filter Algorithms for Approximate String Matching

How?

BLAST

The q-gram Lemma and QUASAR

Filter Algorithms

P

S

Preprocess

Index

ExactAlgorithm

Verification Phase,examine Potential Matches

False Matches True Matches

Potential Matches

Indexed FilterAlgorithm

Page 9: Filter Algorithms for Approximate String Matching

How?

BLAST

The q-gram Lemma and QUASAR

Filter Algorithms

P

S

Preprocess

Potential Matches

IndexIndexed Filter

Algorithm

Con: preprocessing time

extra space required

only good for some filter criteria

Pro: potentially faster evaluation of filter criterium

Page 10: Filter Algorithms for Approximate String Matching

How?

BLAST

The q-gram Lemma and QUASAR

Filter Algorithms

P

S

Preprocess

Potential Matches

IndexIndexed Filter

Algorithm

QUASAR (Burkhardt, Rivals et al. 99):

Filter Criterion: q-gram Lemma (Jokinen, Ukkonen 91)

Index Structure: Lookup table (Jokinen, Ukkonen 91)

with suffix array (Manber, Myers 90)

Match Detection: overlapping rectangles in DP-Matrix

Page 11: Filter Algorithms for Approximate String Matching

|P| =8, q = 3total # of q-grams : |P| - q + 1 = 6

T C GC G A

G A TA T T

T T AT A C

T C G A T T A C

Each error can ´destroy´q matching q-grams=> for k errors lose

kq q-grams

T C GC G A

G A TA T T

T T AT A C

T C G A A T A C

How?

BLAST

The q-gram Lemma and QUASAR

Filter Algorithms

The q-gram Lemma

(Jokinen, Ukkonen, 1991)

For a pattern P, a substring y of S and a value k, matches between P and y with at most k errors share at least

t = |P| - q + 1 - (kq)

substrings of length q (q-grams).

Page 12: Filter Algorithms for Approximate String Matching

How?

BLAST

The q-gram Lemma and QUASAR

Filter Algorithms

Match Detection (Jokinen, Ukkonen 91) :

overlapping rectangles of width 2|P| in DP-Matrix

rectangle with at least t hits => potential match

S

P

3 hits3 hits

2 hits2 hits

1 hitt = 3

Page 13: Filter Algorithms for Approximate String Matching

How?

BLAST

The q-gram Lemma and QUASAR

Filter Algorithms

Match Detection (Jokinen, Ukkonen 91) :

overlapping rectangles of width 2|P| in DP-Matrix

rectangle with at least t hits => potential match

S

P

QUASAR (Burkhardt, Rivals et al. 1999) :

wider rectangles efficient in practice (2048 for QUASAR)

S

Page 14: Filter Algorithms for Approximate String Matching

How?

BLAST

The q-gram Lemma and QUASAR

Filter Algorithms

QUASAR (Burkhardt, Rivals et al. 1999) :

BLAST for the verification of the potential matches

wider Rectangles as Match Regions

Index is a combination of Lookup Table and Suffix Array

used for EST-Clustering at the DKFZ in Heidelberg

searches for EST-Clustering about 30 times faster than BLAST

Page 15: Filter Algorithms for Approximate String Matching

Gapped Gapped qq-grams-grams

A new (old?) idea Hamming Distance Finding good shapes

Page 16: Filter Algorithms for Approximate String Matching

use gapped q-grams call arrangement of gaps the shape

General idea:

Gapped q-grams A new (old ?) idea

Hamming Distance

Finding good shapes

TCGATTACTC.A CG.T GA.T AT.A TT.C

gapped3-shape:

# # . #

Match Don’t care

Page 17: Filter Algorithms for Approximate String Matching

Califano, Rigoutsos (1993) Pevzner, Waterman (1995) Lehtinen, Sutinen, Tarhio (1996)

Previous work...Previous work...

limited attention paid to choice of shapes

no exact threshold for the general case given

Gapped q-grams A new (old ?) idea

Hamming Distance

Finding good shapes

Recently...Recently...Buhler (2001) : Multiple ShapesMa, Tromp, Li (2002) : Pattern Hunter

threshold t = 1

Page 18: Filter Algorithms for Approximate String Matching

The Threshold tDefinition: t is the number of remaining q-grams in a worst-case placement of k errors

OOXOOXOOXOOOOX OXO XOO OOX OXO XOO OOX OXO XOO

classic3-shape###k = 3

gapped3-shape##.#k = 3t = 1t = 0

no filter!

OOOXXOOXOOOOO.X OO.X OX.O XX.O XO.X OO.O OX.O XO.O

Gapped q-grams A new (old ?) idea

Hamming Distance

Finding good shapes

Page 19: Filter Algorithms for Approximate String Matching

OOOXXOOXOOOOO.X OO.X OX.O XX.O XO.X OO.O OX.O XO.O

Definition: t is the number of remaining q-grams in a worst-case placement of k errors

gapped shapes can have higher(!) thresholds t than ungapped shapes

The Threshold t

gapped3-shape##.#k = 3t = 1

classic3-shape###k = 3t = 0

no filter!

no simple formula for t we used a DP-based approach to compute t

Gapped q-grams A new (old ?) idea

Hamming Distance

Finding good shapes

Page 20: Filter Algorithms for Approximate String Matching

Finding good shapes Finding good shapes

high low# of q-gram hits

high lowfiltration time

high

low

verific. time

high

low

# ofpotentialmatches

goodfilters

badfilters

Gapped q-grams A new (old ?) idea

Hamming Distance

Finding good shapes

tradeoffline

low highq

Page 21: Filter Algorithms for Approximate String Matching

Finding good shapes Finding good shapes

high

low

# ofpotentialmatches

Gapped q-grams A new (old ?) idea

Hamming Distance

Finding good shapes

# of q-gram hits |S|1

||q

?tradeoff

line

goodfilters

badfilters

low highq

Page 22: Filter Algorithms for Approximate String Matching

Finding good shapesFinding good shapes

Reason:

##.# ### ##.# ### ----- ----

5 4

A random match requires 5 matching characters instead of only 4 for the ungapped q-gram.This makes random matchesless likely.

Gapped q-grams A new (old ?) idea

Hamming Distance

Finding good shapes

For |P |=13, k=3 and q=3 the shapes ##.# and ### both have a threshold of t=2. However, the gapped shape returns fewer potential matches.

Page 23: Filter Algorithms for Approximate String Matching

We define the minimum coverage cm as the minimum number of matching characters for any distinct arrangement of t matching shapes in P and S

Finding good shapesFinding good shapes

CGACGATTGAT ##.# ##.# -----ACTCGATTAGA

For t =2 andthe shape ##.#the minimum coverage is 5

Gapped q-grams A new (old ?) idea

Hamming Distance

Finding good shapes

Page 24: Filter Algorithms for Approximate String Matching

Finding good shapes Finding good shapes

# ofpotentialmatches

Gapped q-grams A new (old ?) idea

Hamming Distance

Finding good shapes

# of q-gram hits |S|1

||q

low highq

tradeoffline

goodfilters

badfilters

|S|1

||c

m

low

high

cm

Page 25: Filter Algorithms for Approximate String Matching

8 10 12 14 16 18 20 22

0

600

400

200

t = 1t = 2t = 3t = 4t = 5

minimum coverage

number ofshapes

with givenminimum coveragefor k = 5

q = 8

median

contiguous best

• compute t and minimum coverage for all shapes with |P|=50 and k=3,4,5,6

Gapped q-grams A new (old ?) idea

Hamming Distance

Finding good shapes

Finding good shapesFinding good shapes

Page 26: Filter Algorithms for Approximate String Matching

Experimental AnalysisExperimental Analysis

Speed and Filtration Efficiency The Heuristic Zone

Page 27: Filter Algorithms for Approximate String Matching

6 7 8 9 10 11 12 q

min

imum

cov

erag

e

8

12

16

20

24

gapped, Hammingcontiguous

m

atch

es

hits 222 220 218 216 214 212

216

212

28

24

1

2-4

2-8

Experimental Analysis A few different Filters

Speed and Filtration Efficiency

The Heuristic Zonek = 5

|P| = 50|S| = 50Mbps

Page 28: Filter Algorithms for Approximate String Matching

From Hits to Matches Describing Filter Properties

Filters usually have 3 ‚recognition zones` depending on k :1. Guarantee zone (finds all approximate matches)2. Heuristic zone (finds some of the approximate matches)3. Negative zone (guaranteed not to find matches)

Errors |P|0

0%

100%

Rec

ogni

tion

rate

Page 29: Filter Algorithms for Approximate String Matching

From Hits to Matches Describing Filter Properties

Filters usually have 3 ‚recognition zones` depending on k :1. Guarantee zone (finds all approximate matches)2. Heuristic zone (finds some of the approximate matches)3. Negative zone (guaranteed not to find matches)

Errors |P|0 k

0%

100%

Rec

ogni

tion

rate

Page 30: Filter Algorithms for Approximate String Matching

From Hits to Matches Describing Filter Properties

Filters usually have 3 ‚recognition zones` depending on k :1. Guarantee zone (finds all approximate matches)2. Heuristic zone (finds some of the approximate matches)3. Negative zone (guaranteed not to find matches)

Errors |P|0 k

0%

100%

Rec

ogni

tion

rate

|P|-mc

Page 31: Filter Algorithms for Approximate String Matching

From Hits to Matches Describing Filter Properties

Filters usually have 3 ‚recognition zones` depending on k :1. Guarantee zone (finds all approximate matches)2. Heuristic zone (finds some of the approximate matches)3. Negative zone (guaranteed not to find matches)

Errors |P|0 k

0%

100%

Rec

ogni

tion

rate

|P|-mc

Page 32: Filter Algorithms for Approximate String Matching

Errors |P|0 k |P|-mc

0%

100%

Rec

ogni

tion

rate

A few different Filters

Speed and Filtration Efficiency

The Heuristic Zone

Experimental Analysis

Heuristic Zone

Problem:Behaviour in the Heuristic Zone hard to predict

Page 33: Filter Algorithms for Approximate String Matching

A few different Filters

Speed and Filtration Efficiency

The Heuristic Zone

Experimental Analysis

A simple idea:Sampling!

For a value i:1. Generate s sample strings with i random errors each2. Run a filter algorithm on these samples 3. Record how many strings were recognized (in percent)

This allows an experimental evaluation of the Heuristic Zone

Page 34: Filter Algorithms for Approximate String Matching

|P| = 501000 samples for each error level

A few different Filters

Speed and Filtration Efficiency

The Heuristic Zone

Experimental Analysis

0%

100%

Rec

ogni

tion

rat

e

Errors0 5 10 15 20 25 30

contiguous k=3, q=11k=4, q=9

Page 35: Filter Algorithms for Approximate String Matching

A few different Filters

Speed and Filtration Efficiency

The Heuristic Zone

Experimental Analysis

0%

100%

Rec

ogni

tion

rat

e

Errors0 5 10 15 20 25 30

k=4, q=9k=3, q=11

gapped, edit

contiguous

k=5, q=10k=4, q=11k=3, q=11

|P| = 501000 samples for each error level

Page 36: Filter Algorithms for Approximate String Matching

A few different Filters

Speed and Filtration Efficiency

The Heuristic Zone

Experimental Analysis

0%

100%

Rec

ogni

tion

rat

e

Errors0 5 10 15 20 25 30

k=4, q=9k=3, q=11

BLAST

gapped, edit

contiguous

k=5, q=10k=4, q=11k=3, q=11

k=3,q=11k=4,q=10

|P| = 501000 samples for each error level

Page 37: Filter Algorithms for Approximate String Matching

A few different Filters

Speed and Filtration Efficiency

The Heuristic Zone

Experimental Analysis

0%

100%

Rec

ogni

tion

rat

e

Errors0 5 10 15 20 25 30

k=4, q=9k=3, q=11

BLAST

gapped, edit

contiguous

k=5, q=10k=4, q=11k=3, q=11

k=3,q=11k=4,q=10

|P| = 501000 samples for each error level

Page 38: Filter Algorithms for Approximate String Matching

A few different Filters

Speed and Filtration Efficiency

The Heuristic Zone

Experimental Analysis

50%

100%

Rec

ogni

tion

rat

e

Errors0 5 10 15

k=4, q=9k=3, q=11

BLAST

gapped, edit

contiguous

k=3, q=11k=4, q=11k=5, q=10k=3,q=11k=4,q=10

|P| = 501000 samples for each error level

Page 39: Filter Algorithms for Approximate String Matching

Conclusion - Future WorkConclusion - Future WorkOur Work: Significant sensitivity improvement over existing filters Required modifications easy to implement Methods for describing filter properties

Future Work: Combination of `orthogonal` shapes into one filter Use of word neighborhoods Database of filter properties for good shapes