foundations of software design

32
1 Foundations of Software Design Lecture 26: Text Processing, Tries, and Dynamic Programming Marti Hearst & Fredrik Wallenberg Fall 2002

Upload: bishop

Post on 10-Feb-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Foundations of Software Design. Lecture 26: Text Processing, Tries, and Dynamic Programming Marti Hearst & Fredrik Wallenberg Fall 2002. Problem: String Search. Determine if, and where, a substring occurs within a string. Approaches/Algorithms:. Brute Force Rabin-Karp Tries - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Foundations of Software Design

1

Foundations of Software Design

Lecture 26: Text Processing, Tries, and Dynamic Programming Marti Hearst & Fredrik WallenbergFall 2002 

Page 2: Foundations of Software Design

2

Problem: String Search• Determine if, and where, a substring occurs

within a string

Page 3: Foundations of Software Design

3

Approaches/Algorithms:• Brute Force• Rabin-Karp• Tries• Dynamic Programming

Page 4: Foundations of Software Design

4

“Brute Force” Algorithm

Page 5: Foundations of Software Design

5

Worst-case Complexity

Page 6: Foundations of Software Design

6

Best-case Complexity, String Found

Page 7: Foundations of Software Design

7

Best-case Complexity,

String Not Found

Page 8: Foundations of Software Design

8

Rabin-Karp Algorithm• Calculate a hash value for

– The pattern being searched for (length M), and– Each M-character subsequence in the text

• Start with the first M-character sequence– Hash it– Compare the hashed search term against it– If they match, then look at the letters directly

• Why do we need this step?– Else go to the next M-character sequence

(Note 1: Karp is a Turing-award winning prof. in CS here!)(Note 2: CS theory is a good field to be in because they name things after you!)

Page 9: Foundations of Software Design

9

Karp-Rabin: Looking for 31415 31415 mod 13 = 7Thus compute each 5-char substring mod 13 looking for 7

2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1 -----------

8

2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1 -----------

9

2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1 -----------

3

2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1 -----------

7

Found 7! Now check the digits

Page 10: Foundations of Software Design

10

Rabin-Karp Algorithm• Worst case time?

– N is length of the string– O(N) if the hash function is chosen well

• http://orca.st.usm.edu/~suzi/stringmatch/rk_alg.html• http://www.mills.edu

/ACAD_INFO/MCS/CS/S00MCS125/String.Matching.Algorithms/animations.html

(Note 1: Karp is a Turing-award winning prof. in CS here!)(Note 2: CS theory is a good field to be in because they name things after you!)

Page 11: Foundations of Software Design

11

Tries• A tree-based data structure for storing strings

in order to make pattern matching faster• Main idea:

– Store all the strings from the document, one letter at a time, in a tree structure

– Two strings with the same prefix are in the same subtree

• Useful for IR prefix queries– Search for the longest prefix of a query string Q that

matches a prefix of some string in the trie– The name comes from Information Retrieval

Page 12: Foundations of Software Design

12

Trie ExampleThe standard trie over the alphabet {a,b} for the set {aabab, abaab, babbb, bbaaa, bbbab}

Page 13: Foundations of Software Design

13

A Simple Incremental Algorithm• To build the trie, simple add one string at a

time• Check to see if the current character matches

the current node.• If so, move to the next character• If not, make a new branch labeled with the

mismatched character, and then move to the next character

• Repeat

Page 14: Foundations of Software Design

14

Trie-growing Algorithm

a

r

i

d

e

l

lc

k

t

o

p

h

e

a

r

s

e

e

u

b

yl

l l

l

buybellhearseebidbearstopbullsellstock

Page 15: Foundations of Software Design

15

Tries, more formally• The path from the root of T to any node represents a prefix

that is equal to the concatenation of the characters encountered while traversing the path.– An internal node can have from 1 to d children where d is the

size of the alphabet.• The previous example is a binary tree because the alphabet had

only 2 letters– A path from the root of T to an internal node i corresponds to

an i-character prefix of a string S– The height of the tree is the length of the longest string– If there are S unique strings, T has S leaf nodes– Looking up a string of length M is O(M)

Page 16: Foundations of Software Design

16

Compressed Tries

Compression is done after the trie has been built up; can’t add more items.

Page 17: Foundations of Software Design

17

Compressed Tries• Also known as PATRICIA Trie

– Practical Algorithm To Retrieve Information Coded In Alphanumeric – D. Morrison, Journal of the ACM 15 (1968).

• Improves a space inefficiency of Tries• Tries to remove nodes with only one child

(pardon the pun)• The number of nodes is proportional to the number of strings,

not to their total length– But this just makes the node labels longer– So this only helps if an auxiliary data structure is used to actually

store the strings– The trie only stores triplets of numbers indicating where in the

auxiliary data structure to look

Page 18: Foundations of Software Design

18

Compressed Trie

s

hear$ ar

e

b

ll id ll

u

y

to

p ckll

e

e

Page 19: Foundations of Software Design

19

Suffix Tries• Regular tries can only be used to find whole words.• What if we want to search on suffixes?

– build*, mini*• Solution: use suffix tries where each possible suffix is

stored in the trie• Example: minimize

nimize

mi i

ze mize ze

ezenimize

nimize

Find:imi

i

mi

Page 20: Foundations of Software Design

20

Dynamic Programming• Used primarily for optimization problems.

– Not just a good solution, but an optimal one.• Brute force algorithms

– Try every possibility – Guarantee finding the optimal solution– But inefficient

• DP requires a certain amount of structure, namely:– Simple Subproblems (and simple break-down)– Global optimum is a composition of subproblem optimums– Subproblem Overlap: optimal solutions to unrelated

problems can contain subproblems in common.– In other words, can re-use the results of solving the

subproblem

Page 21: Foundations of Software Design

21

Longest Common Subsequence• LCS: find the longest string S that is a

subsequence of both X and Y, where• X is of length n• Y is of length m

• Example: what is the LCS of• supergalactic• galaxy

• (The characters do not have to be contiguous)

Page 22: Foundations of Software Design

22* Longest Common Subsequence

Dynamic Programming Applied to LCS Problem

Let’s compare:X = [GTG] X[0…i]Y = [CGATG] Y[0…j]

We represent the longest subsequence as L[i,j]

Page 23: Foundations of Software Design

23

Dynamic Programming for LCSNote that the longest string of X and Y (L[i,j]) must be equal to the longest string of ...X[0…i-1] = [GT] (removing the last G) Y[0…j-1] = [CGAT] (removing the last G)

… plus 1, since the matching Gs at Xi,Yj will increase the length by one.

Page 24: Foundations of Software Design

24

Dynamic Programming for LCSIf Xi,Yj had NOT matched, L[i,j] would have to be equal to

the longest string in L[i-1,j] or L[i,j-1].If this is true for L[i,j], it must be true for all L. We know that L[-1,-1] = 0 (since both strings are empty) Finally we know that L[i,j] cannot be larger than max(i,j)+1

Page 25: Foundations of Software Design

25

Dynamic Programming for LCS

C G A T G-1 0 1 2 3 4

-1 0 0 0 0 0 0G 0 0T 1 0G 2 0

L[0,1] = 1 (X0,Y1 does match… L[-1,0] + 1)

1

L[0,0] = 0 (X0,Y0 doesn’t match… max of L[-1,0] and L[0,-1])

000

1 1 11

22

321

11

For each position,take the max ofL[i-1,j] or L[i,j-1]

Add 1 when a newmatch is found.

Page 26: Foundations of Software Design

26

Dynamic Programming• Running Time/Space:

– Strings of length m and n– O(mn)– Brute force algorithm: 2m subsequences of x to check

against n elements of y: O(n 2m)

Page 27: Foundations of Software Design

27

Dynamic Programming vs. Greedy Algorithms• Sometimes they are the same.• Sometimes not• What makes an algorithm greedy?

– Globally optimal solution can be obtained by making locally optimal choices

• Dynamic Programming– Solves subproblems, that can be re-used– Trickier to think of– More work to program

Page 28: Foundations of Software Design

28From www.cs.virginia.edu/~luebke/cs332.fall00/lecture23.ppt edu/~bodik/cs536.html

Greedy Vs. Dynamic Programming:

• The famous knapsack problem:– A thief breaks into a museum. Fabulous paintings,

sculptures, and jewels are everywhere. The thief has a good eye for the value of these objects, and knows that each will fetch hundreds or thousands of dollars on the clandestine art collector’s market. But, the thief has only brought a single knapsack to the scene of the robbery, and can take away only what he can carry. What items should the thief take to maximize the haul?

Page 29: Foundations of Software Design

29From www.cs.virginia.edu/~luebke/cs332.fall00/lecture23.ppt edu/~bodik/cs536.html

The Knapsack Problem• More formally, the 0-1 knapsack problem:

– The thief must choose among n items, where the ith item worth vi dollars and weighs wi pounds

– Carrying at most W pounds, want to maximize value• Note: assume vi, wi, and W are all integers• “0-1” b/c each item must be taken or left in entirety

• A variation, the fractional knapsack problem:– Thief can take fractions of items– Think of items in 0-1 problem as gold ingots, in

fractional problem as buckets of gold dust

Page 30: Foundations of Software Design

30From www.cs.virginia.edu/~luebke/cs332.fall00/lecture23.ppt edu/~bodik/cs536.html

The Knapsack Problem: Optimal Substructure• Both variations exhibit optimal substructure• To show this for the 0-1 problem, consider the

most valuable load weighing at most W pounds– If we remove item j from the load, what do we know

about the remaining load?– The remainder must be the most valuable load weighing

at most W - wj that the thief could take from museum, excluding item j

Page 31: Foundations of Software Design

31From www.cs.virginia.edu/~luebke/cs332.fall00/lecture23.ppt edu/~bodik/cs536.html

Solving The Knapsack Problem• The optimal solution to the fractional knapsack

problem can be found with a greedy algorithm• The optimal solution to the 0-1 problem cannot

be found with the same greedy strategy– Greedy strategy: take in order of dollars/pound– Example: 3 items weighing 10, 20, and 30 pounds,

knapsack can hold 50 pounds• Suppose item 2 is worth $100. Assign values to the

other items so that the greedy strategy will fail

Page 32: Foundations of Software Design

32From www.cs.virginia.edu/~luebke/cs332.fall00/lecture23.ppt edu/~bodik/cs536.html

The Knapsack Problem: Greedy Vs. Dynamic• The fractional problem can be solved greedily• The 0-1 problem cannot be solved with a

greedy approach– It can, however, be solved with dynamic

programming