approximate string matching

39
Approximate String Matching A Guided Tour to Approximate String Matching Gonzalo Navarro Justin Wiseman 1

Upload: jennifer-stone

Post on 30-Dec-2015

75 views

Category:

Documents


2 download

DESCRIPTION

Approximate String Matching. A Guided Tour to Approximate String Matching Gonzalo Navarro Justin Wiseman. Outline:. Definition of approximate string matching (ASM) Applications of ASM Algorithms Conclusion. Approximate string matching. - PowerPoint PPT Presentation

TRANSCRIPT

Approximate String Matching

Approximate String MatchingA Guided Tour to Approximate String MatchingGonzalo Navarro

Justin Wiseman11Outline:Definition of approximate string matching (ASM)

Applications of ASM

Algorithms

Conclusion

22Approximate string matchingApproximate string matching is the process of matching strings while allowing for errors.33The edit distanceStrings are compared based on how close they are

This closeness is called the edit distance

The edit distance is summed up based on the number of operations required to transform one string into another

44Levenshtein / edit distanceNamed after Vladimir Levenshtein who created his Levenshtein distance algorithm in 1965

Accounts for three basic operations:

Inserts , deletions, and replacements

In the simplified version, all operations have a cost of 1

Example: mash and march have edit distance of 255Other distance algorithmsHamming distance:Allows only substitutions with a cost of one each

Episode distance:Allows only insertions with a cost of one each

Longest Common Subsequence distance:Allows only insertions and deletions costing one each

66Outline:What is approximate string matching (ASM)?

What are the applications of ASM?

Algorithms

Conclusion77ApplicationsComputational biology

Signal processing

Information retrieval

88Computational biologyDNA is composed of Adenine, Cytosine, Guanine, and Thymine (A,C,G,T)

One can think of the set {A,C,G,T} as the alphabet for DNA sequences

Used to find specific, or similar DNA sequences

Knowing how different two sequences are can give insight to the evolutionary process.99Signal processingUsed heavily in speech recognition software

Error correction for receiving signals

Multimedia and song recognition

1010Information RetrievalSpell checkers

Search enginesWeb searches (Google)Personal files (agrep for unix)

Searching texts with errors such as digitized books

Handwriting recognition1111Outline:What is approximate string matching (ASM)?

What are the applications of ASM?

Algorithms

Conclusion1212AlgorithmsDefinitions

Dynamic Programming algorithms

Automatons

Bit-parallelism

Filters

1313DefinitionsLet be a finite alphabet of size || = Let T * be a text of length n = |T|Let P * be a pattern of length m = |P|Let k R be the maximum error allowedLet d : * * R be a distance functionTherefore, given T, P, k, and d(.), return the set of all text positions j such that there exists i such that d(P, Ti..j) k

1414AlgorithmsDefinitions

Dynamic Programming algorithms

Automatons

Bit-parallelism

Filters

1515Dynamic Programmingoldest to solve the problem of approximate string matching

Not very efficient Runtime of O(|x||y|)However, space is O(min(|x||y|))

Most flexible when adapting to different distance functions

1616Computing the edit distanceTo compute the edit distance: ed(x,y) Create a matrix C0..|x|,0..|y| where Ci,j represents the minimum operations needed to match x1..i to y1..j

Ci,0 = iC0,j = jCi,j = if(xi = yj) then Ci-1, j-1 else 1 + min(Ci-1,Ci,j-1, Ci-1,j-1) 1717Edit distance exampleCi,0 = iC0,j = jif(xi = yj) Ci,j = Ci-1, j-1else Ci,j = 1 +min(Ci-1, Ci,j-1, Ci-1,j-1)

18

18Text searchingThe previous algorithm can be converted to search a text for a given pattern with few changes

Let y = Pattern, and x = TextSet C0,j = 0 so that any text position is the start of a matchCi,j = if(Pi = Tj) then Ci-1,j-1else 1+min(Ci-1,j, Ci,j-1, Ci-1,j-1)1919Text search exampleIn English: if the letters at the index are the same, then the current position = the top left position. If the letters are not the same, then the current position is the minimum of left, top, and top left plus one. 20

20ImprovementsExample algorithm listed was the first

Many DP based algorithms improved on the search time

In 1992, Chang and Lampe produce new algorithm called column partitioning with an average search time of O(kn) where k=errors, n=text length, and =size of alphabet2121AlgorithmsDefinitions

Dynamic Programming algorithms

Automatons

Bit-parallelism

Filters

2222Automatons for approx. searchModel search with a nondeterministic finite automata

1985: Esko Ukkonen proposes a deterministic form

Fast: deterministic form has O(n) worst case search time

Large: space complexity of DFA grows exponentially with respect to the pattern length2323NFA example with k = 2Matching the pattern survey on text surgery

2424ImprovementsIn 1996 Kurtz[1996] proposes lazy construction of DFA

Space requirements reduced to O(mn) 2525AlgorithmsDefinitions

Dynamic Programming algorithms

Automatons

Bit-parallelism

Filters

2626Bit-parallelismTakes advantage of the inherent parallelism of computer when dealing in bits

Changes an existing algorithm to operate at the bit level

Operations can be reduced by factor of w where w is the number of bits in a word

2727Shift-OrWas the first bit-parallel algorithm

Parallelizes the operation of an NFA that tries to match the pattern exactly

NFA has m+1 states28

28Builds table B which stores a bit mask for every character cFor the mask B[c], the bit bi is set if and only if Pi = cSearch state is kept in a machine word D = dm..d1di is 1 when P1..i matches the end of the text scanned so farMatch is registered when dm = 129

29To start, D is set to 1m

D is updated upon reading a new text character using the following formula

D ((D