text processing csc401 – analysis of algorithms chapter 9 text processing objectives: strings...
TRANSCRIPT
CSC401 – Analysis of Algorithms Chapter 9
Text ProcessingText ProcessingObjectives:
• Strings • Pattern matching algorithms
• Brute-force algorithm • Boyer-Moore algorithm • Knuth-Morris-Pratt algorithm
• Tries• Standard tries• Compressed tries • Suffix tries
• Huffman encoding algorithm
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-22
StringsStringsA string is a sequence A string is a sequence of charactersof charactersExamples of strings:Examples of strings:– Java programJava program– HTML documentHTML document– DNA sequenceDNA sequence– Digitized imageDigitized image
An alphabet An alphabet is the is the set of possible set of possible characters for a characters for a family of stringsfamily of stringsExample of alphabets:Example of alphabets:– ASCIIASCII– UnicodeUnicode– {A, C, G, T}{A, C, G, T}
Let Let PP be a string of size be a string of size mm – A substring A substring PP[[i .. ji .. j]] of of PP is the is the
subsequence of subsequence of PP consisting consisting of the characters with ranks of the characters with ranks between between i i and and jj
– A prefix of A prefix of PP is a substring of is a substring of the type the type PP[0 [0 .. i.. i]]
– A suffix of A suffix of PP is a substring of is a substring of the type the type PP[[i ..m i ..m 1] 1]
Given strings Given strings TT (text) and (text) and PP (pattern), the pattern (pattern), the pattern matching problem consists matching problem consists of finding a substring of of finding a substring of TT equal to equal to PPApplications:Applications:Text editors, Search engines, Text editors, Search engines,
Biological researchBiological research
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-33
Brute-Force AlgorithmBrute-Force AlgorithmThe brute-force pattern The brute-force pattern matching algorithm matching algorithm compares the pattern compares the pattern PP with the text with the text TT for each for each possible shift of possible shift of PP relative to relative to TT, until , until eithereither– a match is found, ora match is found, or– all placements of the all placements of the
pattern have been triedpattern have been tried
Brute-force pattern Brute-force pattern matching runs in time matching runs in time OO((nmnm)) Example of worst case:Example of worst case:– T T aaa … ah aaa … ah– P P aaah aaah– may occur in images and may occur in images and
DNA sequencesDNA sequences– unlikely in English textunlikely in English text
Algorithm BruteForceMatch(T, P)Input text T of size n and pattern
P of size mOutput starting index of a
substring of T equal to P or 1 if no such substring exists
for i 0 to n m{ test shift i of the pattern }j 0while j m T[i j] P[j]
j j 1if j m
return i {match at i}else
break while loop {mismatch}return -1 {no match anywhere}
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-44
Boyer-Moore HeuristicsBoyer-Moore HeuristicsThe Boyer-Moore’s pattern matching algorithm is The Boyer-Moore’s pattern matching algorithm is based on two heuristicsbased on two heuristicsLooking-glass heuristic:Looking-glass heuristic: Compare Compare PP with a with a subsequence of subsequence of TT moving backwards moving backwardsCharacter-jump heuristic:Character-jump heuristic: When a mismatch occurs When a mismatch occurs at at TT[[ii] ] c c – If If P P contains contains cc, shift , shift PP to align the last occurrence of to align the last occurrence of c c in in P P
with with TT[[ii] ] – Else, shift Else, shift PP to align to align PP[0][0] with with TT[[i i 1] 1]
Example Example
1
a p a t t e r n m a t c h i n g a l g o r i t h m
r i t h m
r i t h m
r i t h m
r i t h m
r i t h m
r i t h m
r i t h m
2
3
4
5
6
7891011
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-55
Last-Occurrence FunctionLast-Occurrence FunctionBoyer-Moore’s algorithm preprocesses the Boyer-Moore’s algorithm preprocesses the pattern pattern PP and the alphabet and the alphabet to build the last- to build the last-occurrence function occurrence function LL mapping mapping to integers, to integers, where where LL((cc)) is defined as is defined as– the largest index the largest index ii such that such that PP[[ii]] c c oror 11 if no such index exists if no such index exists
Example:Example: {{a, b, c, da, b, c, d}}– PP abacababacab
The last-occurrence function can be represented The last-occurrence function can be represented by an array indexed by the numeric codes of the by an array indexed by the numeric codes of the characterscharactersThe last-occurrence function can be computed in The last-occurrence function can be computed in time time OO((m m s s)), where , where mm is the size of is the size of PP and and ss is the is the size of size of
cc aa bb cc dd
LL((cc)) 44 55 33 11
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-66
m j
i
j l
. . . . . . a . . . . . .
. . . . b a
. . . . b a
j
Case 1: j 1l
The Boyer-Moore AlgorithmThe Boyer-Moore AlgorithmAlgorithm BoyerMooreMatch(T, P, )
L lastOccurenceFunction(P, )i m 1j m 1repeat
if T[i] P[j]if j 0
return i { match at i }
elsei i 1j j 1
else{ character-jump }l L[T[i]]i i m – min(j, 1l)j m 1
until i n 1return 1 { no match }
m (1 l)
i
jl
. . . . . . a . . . . . .
. a . . b .
. a . . b .
1 l
Case 2: 1lj
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-77
ExampleExample
1
a b a c a a b a d c a b a c a b a a b b
234
5
6
7
891012
a b a c a b
a b a c a b
a b a c a b
a b a c a b
a b a c a b
a b a c a b1113
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-88
AnalysisAnalysisBoyer-Moore’s algorithm Boyer-Moore’s algorithm runs in time runs in time OO((nm nm s s))Example of worst case:Example of worst case:– T T aaa … a aaa … a– P P baaa baaa
The worst case may The worst case may occur in images and occur in images and DNA sequences but is DNA sequences but is unlikely in English textunlikely in English textBoyer-Moore’s algorithm Boyer-Moore’s algorithm is significantly faster is significantly faster than the brute-force than the brute-force algorithm on English algorithm on English texttext
11
1
a a a a a a a a a
23456
b a a a a a
b a a a a a
b a a a a a
b a a a a a
7891012
131415161718
192021222324
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-99
The KMP Algorithm - MotivationThe KMP Algorithm - MotivationKnuth-Morris-Pratt’s Knuth-Morris-Pratt’s algorithm compares algorithm compares the pattern to the text the pattern to the text in in left-to-rightleft-to-right, but , but shifts the pattern more shifts the pattern more intelligently than the intelligently than the brute-force algorithm. brute-force algorithm. When a mismatch When a mismatch occurs, what is the occurs, what is the mostmost we can shift the we can shift the pattern so as to avoid pattern so as to avoid redundant redundant comparisons?comparisons?Answer: the largest Answer: the largest prefix of prefix of PP[0..[0..jj]] that is a that is a suffix of suffix of PP[1..[1..jj]]
x
j
. . a b a a b . . . . .
a b a a b a
a b a a b a
No need torepeat thesecomparisons
Resumecomparing
here
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-1010
KMP Failure FunctionKMP Failure FunctionKnuth-Morris-Pratt’s algorithm Knuth-Morris-Pratt’s algorithm
preprocesses the pattern to preprocesses the pattern to
find matches of prefixes of the find matches of prefixes of the
pattern with the pattern itselfpattern with the pattern itself
The The failure functionfailure function FF((jj)) is is
defined as the size of the defined as the size of the
largest prefix of largest prefix of PP[0..[0..jj]] that is that is
also a suffix of also a suffix of PP[1..[1..jj]]
Knuth-Morris-Pratt’s algorithm Knuth-Morris-Pratt’s algorithm
modifies the brute-force modifies the brute-force
algorithm so that if a mismatch algorithm so that if a mismatch
occurs at occurs at PP[[jj]]TT[[ii] ] we set we set j j
FF((j j 1)1)
jj 00 11 22 33 44
PP[[jj]] aa bb aa aa bb aa
FF((jj)) 00 00 11 11 22
x
j
. . a b a a b . . . . .
a b a a b a
F(j 1)
a b a a b a
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-1111
The KMP AlgorithmThe KMP AlgorithmThe failure function can The failure function can be represented by an be represented by an array and can be array and can be computed in computed in OO((mm)) time timeAt each iteration of the At each iteration of the while-loop, eitherwhile-loop, either– ii increases by one, or increases by one, or– the shift amount the shift amount i i j j
increases by at least one increases by at least one (observe that (observe that FF((j j 1)1) < < jj))
Hence, there are no more Hence, there are no more than than 22n n iterations of the iterations of the while-loopwhile-loopThus, KMP’s algorithm Thus, KMP’s algorithm runs in optimal time runs in optimal time OO((m m n n))
Algorithm KMPMatch(T, P)F failureFunction(P)i 0j 0while i n
if T[i] P[j]if j m 1
return i j { match }
elsei i 1j j 1
elseif j 0
j F[j 1]else
i i 1return 1 { no match }
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-1212
Computing the Failure FunctionComputing the Failure FunctionThe failure function can be The failure function can be
represented by an array and represented by an array and
can be computed in can be computed in OO((mm)) time time
The construction is similar to The construction is similar to
the KMP algorithm itselfthe KMP algorithm itself
At each iteration of the while-At each iteration of the while-
loop, eitherloop, either
– ii increases by one, or increases by one, or
– the shift amount the shift amount i i j j increases increases
by at least one (observe that by at least one (observe that
FF((j j 1)1) < < jj))
Hence, there are no more Hence, there are no more
than than 22m m iterations of the iterations of the
while-loopwhile-loop
Algorithm failureFunction(P)F[0] 0i 1j 0while i m
if P[i] P[j]{we have matched j + 1
chars}F[i] j + 1i i 1j j 1
else if j 0 then{use failure function to shift
P}j F[j 1]
elseF[i] 0 { no match }i i 1
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-1313
ExampleExample
1
a b a c a a b a c a b a c a b a a b b
7
8
19181715
a b a c a b
1614
13
2 3 4 5 6
9
a b a c a b
a b a c a b
a b a c a b
a b a c a b
10 11 12
c
jj 00 11 22 33 44
PP[[jj]] aa bb aa cc aa bb
FF((jj)) 00 00 11 00 11
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-1414
Preprocessing StringsPreprocessing StringsPreprocessing the pattern speeds up pattern Preprocessing the pattern speeds up pattern matching queriesmatching queries– After preprocessing the pattern, KMP’s algorithm After preprocessing the pattern, KMP’s algorithm
performs pattern matching in time proportional to the performs pattern matching in time proportional to the text sizetext size
If the text is large, immutable and searched for If the text is large, immutable and searched for often (e.g., works by Shakespeare), we may often (e.g., works by Shakespeare), we may want to preprocess the text instead of the want to preprocess the text instead of the patternpattern
A trie is a compact data structure for A trie is a compact data structure for representing a set of strings, such as all the representing a set of strings, such as all the words in a textwords in a text– A tries supports pattern matching queries in time A tries supports pattern matching queries in time
proportional to the pattern sizeproportional to the pattern size
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-1515
Standard Trie (1) Standard Trie (1) The standard trie for a set of strings S is an ordered tree The standard trie for a set of strings S is an ordered tree such that:such that:– Each node but the root is labeled with a characterEach node but the root is labeled with a character– The children of a node are alphabetically orderedThe children of a node are alphabetically ordered– The paths from the external nodes to the root yield the strings The paths from the external nodes to the root yield the strings
of Sof S
Example: standard trie for the set of stringsExample: standard trie for the set of stringsS = { bear, bell, bid, bull, buy, sell, stock, stop }S = { bear, bell, bid, bull, buy, sell, stock, stop }
a
e
b
r
l
l
s
u
l
l
y
e t
l
l
o
c
k
p
i
d
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-1616
Standard Trie (2)Standard Trie (2)A standard trie uses A standard trie uses OO((nn)) space and supports space and supports searches, insertions and deletions in time searches, insertions and deletions in time OO((dmdm)), where:, where:nn total size of the strings in Stotal size of the strings in Smm size of the string parameter of the operationsize of the string parameter of the operationd d size of the alphabet size of the alphabet
a
e
b
r
l
l
s
u
l
l
y
e t
l
l
o
c
k
p
i
d
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-1717
Word Matching with a TrieWord Matching with a TrieWe insert We insert the words the words of the text of the text into a trieinto a trieEach leaf Each leaf stores the stores the occurrenceoccurrences of the s of the associated associated word in the word in the text text
s e e b e a r ? s e l l s t o c k !
s e e b u l l ? b u y s t o c k !
b i d s t o c k !
a
a
h e t h e b e l l ? s t o p !
b i d s t o c k !
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
a r87 88
a
e
b
l
s
u
l
e t
e
0, 24
o
c
i
l
r
6
l
78
d
47, 58l
30
y
36l
12k
17, 40,51, 62
p
84
h
e
r
69
a
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-1818
Compressed TrieCompressed TrieA compressed trie A compressed trie has internal nodes of has internal nodes of degree at least twodegree at least twoIt is obtained from It is obtained from standard trie by standard trie by compressing chains compressing chains of “redundant” nodesof “redundant” nodes
e
b
ar ll
s
u
ll y
ell to
ck p
id
a
e
b
r
l
l
s
u
l
l
y
e t
l
l
o
c
k
p
i
d
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-1919
Compact RepresentationCompact RepresentationCompact representation of a compressed trie for an array Compact representation of a compressed trie for an array ofof strings:strings:– Stores at the nodes ranges of indices instead of substringsStores at the nodes ranges of indices instead of substrings– Uses Uses OO((ss) ) space, where space, where s s is the number of strings in the arrayis the number of strings in the array– Serves as an auxiliary index structureServes as an auxiliary index structure
s e e
b e a r
s e l l
s t o c k
b u l l
b u y
b i d
h e
b e l l
s t o p
0 1 2 3 4a rS[0] =
S[1] =
S[2] =
S[3] =
S[4] =
S[5] =
S[6] =
S[7] =
S[8] =
S[9] =
0 1 2 3 0 1 2 3
1, 1, 1
1, 0, 0 0, 0, 0
4, 1, 1
0, 2, 2
3, 1, 2
1, 2, 3 8, 2, 3
6, 1, 2
4, 2, 3 5, 2, 2 2, 2, 3 3, 3, 4 9, 3, 3
7, 0, 3
0, 1, 1
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-2020
Suffix Trie (1)Suffix Trie (1)The suffix trie of a string The suffix trie of a string XX is the compressed is the compressed trie of all the suffixes of trie of all the suffixes of XX
e nimize
nimize ze
zei mi
mize nimize ze
m i n i z em i0 1 2 3 4 5 6 7
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-2121
Suffix Trie (2)Suffix Trie (2)Compact representation of the suffix trie for a Compact representation of the suffix trie for a string string XX of size of size nn from an alphabet of size from an alphabet of size dd– Uses Uses OO((nn)) space space– Supports arbitrary pattern matching queries in Supports arbitrary pattern matching queries in XX in in
OO((dmdm)) time, where time, where mm is the size of the pattern is the size of the pattern
7, 7 2, 7
2, 7 6, 7
6, 7
4, 7 2, 7 6, 7
1, 1 0, 1
m i n i z em i0 1 2 3 4 5 6 7
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-2222
Encoding Trie (1)Encoding Trie (1)A code is a mapping of each character of an alphabet to a A code is a mapping of each character of an alphabet to a binary code-wordbinary code-word
A prefix code is a binary code such that no code-word is A prefix code is a binary code such that no code-word is the prefix of another code-wordthe prefix of another code-word
An encoding trie represents a prefix codeAn encoding trie represents a prefix code– Each leaf stores a characterEach leaf stores a character– The code word of a character is given by the path from the The code word of a character is given by the path from the
root to the leaf storing the character (0 for a left child and 1 root to the leaf storing the character (0 for a left child and 1 for a right childfor a right child
a
b c
d e
0000 010010 011011 1010 1111
aa bb cc dd ee
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-2323
Encoding Trie (2)Encoding Trie (2)Given a text string Given a text string XX, we want to find a prefix code for , we want to find a prefix code for the characters of the characters of XX that yields a small encoding for that yields a small encoding for XX– Frequent characters should have long code-wordsFrequent characters should have long code-words– Rare characters should have short code-wordsRare characters should have short code-words
ExampleExample– X X == abracadabraabracadabra– TT11 encodes encodes XX into into 2929 bits bits– TT22 encodes encodes XX into into 2424 bits bits
c
a r
d b a
c d
b r
T1 T2
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-2424
Huffman’s AlgorithmHuffman’s AlgorithmGiven a string Given a string XX, , Huffman’s algorithm Huffman’s algorithm construct a prefix construct a prefix code the minimizes code the minimizes the size of the the size of the encoding of encoding of XX
It runs in timeIt runs in timeOO((nnd d loglog d d)), where , where nn is the size of is the size of XX and and dd is the number of is the number of distinct characters of distinct characters of XX
A heap-based priority A heap-based priority queue is used as an queue is used as an auxiliary structureauxiliary structure
Algorithm HuffmanEncoding(X)Input string X of size nOutput optimal encoding trie for XC distinctCharacters(X)computeFrequencies(C, X)Q new empty heap for all c C
T new single-node tree storing cQ.insert(getFrequency(c), T)
while Q.size() > 1f1 Q.minKey()
T1 Q.removeMin()
f2 Q.minKey()
T2 Q.removeMin()
T join(T1, T2)
Q.insert(f1 + f2, T)return Q.removeMin()
CSC401: Analysis of AlgorithmsCSC401: Analysis of Algorithms 9-9-2525
ExampleExample
aa bb cc dd rr
55 22 11 11 22
X = abracadabraFrequencies
ca rdb5 2 1 1 2
ca rdb
2
5 2 2
ca bd r
2
5
4
ca bd r
2
5
4
6
c
a
bd r
2 4
6
11