suffix trees. 2 outline introduction suffix trees (st) building sts in linear time: ukkonen’s...

72
Suffix Trees

Upload: moriah-batchelor

Post on 15-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

Suffix Trees

Page 2: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

2

Outline

Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm

Applications of ST

Page 3: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

3

Introduction

Page 4: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

4

Substrings String is any sequence of characters. Substring of string S is a string composed

of characters i through j, i j of S. S = cater;ate is a substring.car is not a substring. Empty string is a substring of S.

Page 5: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

5

Subsequences

Subsequence of string S is a string composed of characters i1 < i2 < … < ik of S. S = cater; ate is a subsequence.car is a subsequence.Empty string is a subsequence of S.

Page 6: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

6

String/Pattern Matching - I You are given a source string S. Suppose we have to answer queries of the

form: is the string pi a substring of S? Knuth-Morris-Pratt (KMP) string matching.

O(|S| + | pi |) time per query.

O(n|S| + i | pi |) time for n queries.

Suffix tree solution.O(|S| + i | pi |) time for n queries.

Page 7: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

7

String/Pattern Matching - II KMP preprocesses the query string pi,

whereas the suffix tree method preprocesses the source string (text) S.

The suffix tree for the text is built in O(m) time during a pre-processing stage; thereafter, whenever a string of length O(n) is input, the algorithm searches it in O(n) time using that suffix tree.

Page 8: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

String Matching: Prefixes & Suffixes

Substrings of S beginning at the first position of S are called prefixes of S, and substrings that end at its last position are called suffixes of S.

S=AACTAGPrefixes: AACTAG,AACTA,AACT,AAC,AA,ASuffixes: AACTAG,ACTAG,CTAG,TAG,AG,G

pi is a substring of S iff pi is a prefix of some suffix of S.

Page 9: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

9

Suffix Trees

Page 10: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

10

Definition: Suffix Tree (ST) T for S of length m

1. A rooted tree with m leaves numbered from 1 to m.

2. Each internal node, excluding the root, of T has at least 2 children.

3. Each edge of T is labeled with a nonempty substring of S.

4. No two edges out of a node can have edge-labels starting with the same character.

5. For any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S, namely S[i,m], that starts at position i.

Page 11: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

11

Example: Suffix Tree for S=xabxac

Page 12: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

12

Existence of a suffix tree S

If one suffix Sj of S matches a prefix of another suffix Si of S, then the path for Sj would not end at a leaf.

S = xabxa S1 = xabxa and S4 = xa How to avoid this problem?

Assume that the last character of S appears nowhere else in S. Add a new character $ not in the alphabet to the end of S.

Page 13: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

13

Example: Suffix Tree for S=xabxac$

Page 14: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

14

Building STs in linear time:

Ukkonen’s algorithm

Page 15: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

15

Building STs in linear time

Weiner’s algorithm [FOCS, 1973] ”The algorithm of 1973” called by Knuth First algorithm of linear time, but much space

McGreight’s algorithm [JACM, 1976] Linear time and quadratic space More readable

Ukkonen’s algorithm [Algorithmica, 1995] Linear time algorithm and less space This is what we will focus on

Page 16: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

16

Implicit Suffix Trees Ukkonen’s algorithm constructs a sequence of implicit STs, the

last of which is converted to a true ST of the given string. An implicit suffix tree for string S is a tree obtained from the suffix

tree for S$ by removing $ from all edge labels removing any edges that now have no label removing any node that does not still have at least two children

An implicit suffix tree for prefix S[1,i] of S is similarly defined based on the suffix tree for S[1,i]$.

Ii will denote the implicit suffix tree for S[1,i]. Each suffix is in the tree, but may not end at a leaf.

Page 17: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

17

Example: Construction of the Implicit ST

Implicit tree for xabxa from tree for xabxa$ {xabxa$, abxa$, bxa$, xa$, a$, $}

1

bb

bx

x

xx

a

aa

aa$

$

$

$

$$

23

6

5

4

Page 18: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

18

Construction of the Implicit ST: Remove $

Remove $ {xabxa$, abxa$, bxa$, xa$, a$, $}

1

bb

bx

x

xx

a

aa

aa$

$

$

$

$$

23

6

5

4

Page 19: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

19

Construction of the Implicit ST: After the Removal of $

Remove $ {xabxa, abxa, bxa, xa, a}

1

bb

bx

x

xx

a

aa

aa

23

6

5

4

Page 20: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

20

Construction of the Implicit ST: Remove unlabeled edges

Remove unlabeled edges {xabxa, abxa, bxa, xa, a}

1

bb

bx

x

xx

a

aa

aa

23

6

5

4

Page 21: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

21

Construction of the Implicit ST: After the Removal of Unlabeled Edges

Remove unlabeled edges {xabxa, abxa, bxa, xa, a}

1

bb

bx

x

xx

a

aa

aa

23

Page 22: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

22

Construction of the Implicit ST: Remove interior nodes

Remove internal nodes with only one child {xabxa, abxa, bxa, xa, a}

1

bb

bx

x

xx

a

aa

aa

23

Page 23: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

23

Construction of the Implicit ST: Final implicit tree

Remove internal nodes with only one child {xabxa, abxa, bxa, xa, a}

1

b b

bx

x

xxa a a

aa

23

Page 24: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

24

Ukkonen’s Algorithm (UA) Ii is the implicit suffix tree of the string S[1, i] Construct I1 /* Construct Ii+1 from Ii */ for i = 1 to m-1 do /* phase i+1 */

for j = 1 to i+1 do /* extension j */ Find the end of the path P from the root whose

label is S[j, i] in Ii and extend P with S[i+1] by suffix extension rules;

Convert Im into a suffix tree S

Page 25: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

25

Example S = xabxacd$ i+1 = 1

x i+1 = 2

extend x to xa a

i+1 = 3 extend xa to xab extend a to ab b

Page 26: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

26

Extension Rules Goal: extend each S[j,i] into S[j,i+1] Rule 1: S[j,i] ends at a leaf

Add character S(i+1) to the end of the label on that leaf edge Rule 2: S[j,i] doesn’t end at a leaf, and the following

character is not S(i+1) Split a new leaf edge for character S(i+1) May need to create an internal node if S[j,i] ends in the

middle of an edge Rule 3: S[j,i+1] is already in the tree

No update

Page 27: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

27

Example: Extension Rules

Implicit tree for axabxb from tree for axabx

1

bb

bx x

x

xa a

a

2

3x

bx4b

b

b

b

Rule 1: at a leaf node

b

Rule 3: already in treeRule 2: add a leaf edge (and an interior node)

b

5

Page 28: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

28

UA for axabxc (1)

S[1,3]=axa

E S(j,i) S(i+1)

1 ax a2 x a3   a

Page 29: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

29

UA for axabxc (2)

Page 30: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

30

UA for axabxc (3)

Page 31: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

31

UA for axabxc (4)

Page 32: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

32

Observations Once S[j,i] is located in the tree, extension rules take only

constant time Naively we could find the end of any suffix S[j,i] in O(S[j,i]) time

by walking from the root of the current tree. By that approach, Im could be created in O(m3) time.

Making Ukkonen’s algorithm O(m) Suffix links Skip and count trick Edge-label compression A stopper Once a leaf, always a leaf

Page 33: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

33

Suffix Links Consider the two strings a and xa Suppose some internal node v of the tree is labeled with xa

and another node s(v) in the tree is labeled with a The edge (v,s(v)) is called a suffix link Do all internal nodes (the root is not considered an internal

node) have suffix links?

Page 34: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

34

Example: suffix links

S = ACACACAC$

AC

AC$

AC

AC

$ $

$

$

$

$

$

C

AC

AC

AC$

Page 35: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

35

Suffix Link Lemma If a new internal node v with path-label xa is

added to the current tree in extension j of some phase i+1, then the path labeled a already ends at an internal

node of the tree or the internal node labeled a will be created in

the extension of j+1 in the same phase i+1string a is empty and s(v) is the root

Page 36: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

36

Proof of Suffix Link Lemma A new internal node is created only by the extension rule

2 This means that there are two distinct suffixes of S[1,i+1]

that start with xaxaS(i+1) and xacb where c is not S(i+1)

This means that there are two distinct suffixes of S[1,i+1] that start with a aS(i+1) and acb where c is not S(i+1)

Thus, if a is not empty, a will label an internal node once extension j+1 is processed which is the extension of a

Page 37: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

37

Corollary of Suffix Link Lemma Every internal node of an implicit suffix

tree has a suffix link from it.

Page 38: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

38

How to use suffix links - 1 S[1,i] must end at a leaf since it is the longest

string in the implicit tree Ii Keep a pointer to this leaf in all cases and extend

according to rule 1 Locating S[j+1,i] from S[j,i] which is at node w

If w is an internal node, set v to w Otherwise, set v = parent(w) If v is the root, you must traverse from the root to find

S[j+1,i] If not, go to s(v) and begin search for the remaining

portion of S[j,i] from there

Page 39: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

39

How to use suffix links - 2

Page 40: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

40

Skip and Count Trick – (1) Problem: Moving down from s(v), directly

implemented, takes time proportional to the number of characters compared

Solution: To make running time proportional to the number of nodes in the path searched, instead of the number of characters

Page 41: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

41

Skip and Count Trick – (2) After 4 nodes down-skips, the end of S[j, i]

is found.

Page 42: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

42

Skip and Count Trick – (3)

Node-depth of v, denoted (ND(v)), is the number of nodes on the path from the root to the node v

Lemma: For any suffix link (v, s(v)) traversed in Ukkonen’s algorithm, at that moment, ND(v) ND(s(v))+1

Page 43: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

43

Skip and Count Trick – (4) At the moment of traversing (v,s(v)):

ND(v) ND(s(v))+1

Page 44: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

44

Skip and Count Trick – (5) The current node-depth of the algorithm is the node

depth of the node most recently visited by the algorithm Lemma: Using the skip and count trick, any phase of

Ukkonen’s algorithm takes O(m) time. Up-walk: decreases the current node-depth by 1 Suffix link traversal: same as up-walk Totally, the current node-depth is decreased by 2m. No node has depth >m. The total possible increment to the current node-depth is

3m.

Page 45: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

45

Edge Label Representation Potential Problem

Size of edge labels may require (m2) space Thus, the time for the algorithm is at least as large as the size of

its output Example

S = abcdefghijklmnopqrstuvwxyz Total length is j<m+1 j = m(m+1)/2

Similar problem can happen when the length of the string is arbitrarily larger than the alphabet size

Solution Label edges with pair of indices indicating beginning and end of

positions of the substring in S

Page 46: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

46

Modified Extension Rules

Rule 2: new leaf edge (phase i+1) create edge (i+1, i+1) split edge (p, q) => (p, w) and (w + 1, q)

Rule 1: leaf edge extension label had to be (p,i) before extension given rule 2 above and an induction argument:

(p, q) => (p, q + 1)

Rule 3 Do nothing

Page 47: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

47

Full edge label representation

String S = xabxa$

1

bb

bx

x

xx

a

aa

aa$

$

$

$

$$

23

6

5

4

Page 48: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

48

Edge-label Compression

String S = xabxa$

1

(1,2) [or (4,5)?]

(3,6)

(2,2)(6,6)

23

65

4

(3,6)(6,6)

(6,6) (3,6)

Page 49: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

49

A Stopper In any phase, if suffix extension rule 3 applies in

extension j, it will also apply in all extensions k, where k>j, until the end of the phase.

The extensions in phase i+1 that are done after the first execution of rule 3 are said to be done implicitly. This is in contrast to any extension j where the end of S[j, i] is explicitly found. An extension of that kind is called and explicit extension.

Hence, we can end any phase i+1 when the first extension rule 3 applies.

Page 50: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

50

Once a leaf, always a leaf – (1) If at some point in UA a leaf is created and

labeled j (for the suffix starting at position j of S), then that leaf will remain a leaf in all successive trees created during the algorithm.

In any phase i, there is an initial sequence of consecutive extensions (starting with extension 1) in which only rule 1 or 2 applies, where let ji be the last extension in this sequence.

Note that ji ji+1.

Page 51: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

51

Once a leaf, always a leaf – (2)

Implicit extensions: for extensions 1 to ji, write [·, e] on the leaf edge, where e is a symbol denoting the current end and is set to i + 1 once at the beginning. In later phases, we will not need to explicitly extend this leaf but rather can implicitly extend it by incrementing e once in its global location.

Explicit extensions: from extension ji+1 till first rule 3 extension is found (or until extension i+1 is done)

Page 52: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

52

Once a leaf, always a leaf – (3): Single phase algorithm

j* is the last explicit extension computed in phase i+1 Phase i+1

Increment e to i+1 (implicitly extending all existing leaves) Explicitly compute successive extensions starting at ji+1 and

continuing until reaching the first extension j* where rule 3 applies or no more extensions needed

Set ji+1 to j* -1, to prepare to the next phase Observation

Phase i and i+1 share at most 1 explicit extension

Page 53: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

53

Example: S=axaxbb$ - (1) e = 1, a J1 = 1

1

1

(1,e)

I1

e = 2, ax S[1,2] : skip S[2,2] : rule 2, create(2, e) j2 = 2

1

1

(1,e)

I2

2

(2,e)

e = 3, axa S[1,3] .. S[2,3] : skip S[3,3] : rule 3 j3 = 2

1

1

(1,e)

I3

2

(2,e)

Page 54: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

54

Example: S=axaxbb$ - (2) e = 4, axax S[1,4] .. S[2,4] : skip S[3,4] : rule 3 S[4,4] : auto skip j4 = 2

1

1

(1,e)

I4

2

(2,e)

e = 5, axaxb S[1,5] .. S[2,5] : skip S[3,5] : rule 2, split (1,e)

→ (1, 2) and (3,e), create (5,e) S[4,5] : rule 2, split (2,e)

→ (2,2) and (3,e), create (5,e) S[5,5] : rule 2, create (5,e) j5 = 5

1(1,2)

I5

2

(2,2)

1 3

(3,e)(5,e)

4

(3,e)(5,e)

(5,e)

5

Page 55: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

55

Example: S=axaxbb$ - (3)

e = 6, axaxbb S[1,6] .. S[5,6] : skip S[6,6] : rule 3 j6 = 5

e = 7, axaxbb$ S[1,7] .. S[5,7] : skip S[6,7] : rule 2, split (5,e)

→ (5,5) and (6,e), create (6,e) S[7,7] : rule 2, create (7,e) j7 = 7

1(1,2)

I6

2

(2,2)

1 3

(3,e)(5,e)

4

(3,e)(5,e)

(5,e)

5

1(1,2)

I7

2

(2,2)

1 3

(3,e)(5,e)

4

(3,e)(5,e)

(5,5)

5

6(6,e)

(7,e)

(7,e)7

Page 56: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

56

Complexity of UA Since all the implicit extensions in any phase is constant,

their total cost is O(m). Totally, only 2m explicit extensions are executed. The max number of down-walking skips is O(m). Time-complexity of Ukkonen’s algorithm: O(m)

Page 57: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

57

Finishing up

Convert final implicit suffix tree to a true suffix tree

Add $ using just another phase of executionNow all suffixes will be leaves

Replace e in every leaf edge with mJust requires a traversal of tree which is O(m)

time

Page 58: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

58

Implementation Issues – (1) When the size of the alphabet grows:

For large trees suffix links allow an algorithm to move quickly from one part of the tree to a distant part of the tree. This is great for worst-case time bounds, but its horrible if the tree isn't entirely in memory.

Thus, implementing ST to reduce practical space use can be a serious concern.

The main design issues are how to represent and search the branches out of the nodes of the tree.

A practical design must balance between constraints of space and need for speed

Page 59: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

59

Implementation Issues – (2) There are four basic choices to represent branches

An array of size (||) at each non-leaf node v A linked list at node v of characters that appear at the beginning of the

edge-labels out of v. If its kept in sorted order it reduces the average time to search for a

given character In the worst case it, adds time || to every node operation. If the

number of children of v is large, then little space is saved over the array while noticeably degrading performance

A balanced tree implements the list at node v Additions and searches take O(logk) time and O(k) space, where k is

the number of children of v. This alternative makes sense only when k is fairly large.

A hashing scheme. The challenge is to find a scheme balancing space with speed. For large trees and alphabets hashing is very attractive at least for some of the nodes

Page 60: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

60

Implementation Issues – (3)

When m and are large enough, the best design is probably a mixture of the above choices. Nodes near the root of the tree tend to have the most children,

so arrays are sensible choice at those nodes. For nodes in the middle of a suffix tree, hashing or balanced

trees may be the best choice.

Sometimes the alphabet size is explicitly presented in the time and space bounds Construction time is O(m log||),using (m||) space. m – is the

size of the string.

Page 61: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

61

Applications of Suffix Trees

Page 62: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

62

Applications of Suffix Trees

Exact string matching Substring problem for a database Longest common substring Suffix arrays

Page 63: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

63

Exact String Matching – (1) Exact matching problem: given a pattern P of length n

and a text T of length m, find all occurrences of P in T in O(n+m) time.

Overview of the ST approach: Built a ST for text T in O(m) time Match the characters of P along the unique path in ST until either

P is exhausted or no more matches are possible In the latter case, P doesn’t appear anywhere in T In the former case, every leaf in the subtree below the point of the

last match is numbered with a starting location of P in T, and every starting location of P in T numbers such a leaf

ST approach spends O(m) preprocessing time and then O(n+k) search time, where k is the number of occurrences of P in T

Page 64: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

64

Exact String Matching – (2)

When search terminates at an element node, P appears exactly once in the source string T.

When the search for P terminates at a branch node, each element node in the subtree rooted at this branch node gives a different occurrence of P.

6

x

14253

a

b

x a

c

c

a

b x a c

c

c

b

x

a

c

Notes:

P = xa; T = xabxac

Pattern P matches a path down to the point shown by an arrow, and as required, the leaves below that point are numbered 1 and 4.

Page 65: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

65

Substring Problem for a Database Input: a database D (a set of strings) and a string S Output: find all the strings in D containing S as a

substring Usage: identity of the person Exact string matching methods: cannot work Suffix tree:

D is stored in O(m) space, where m is the total length of all the strings in D.

Suffix tree is built in O(m) time Each lookup of S (the length of S is n) costs O(n) time

Page 66: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

66

Longest common substring –(1) Input: two strings S1 and S2

Output: find the longest substring S common to S1 and S2

Example: S1=common-substring

S2=common-subsequence

Then, S=common-subs

Page 67: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

67

Longest common substring – (2) Build a suffix tree for S1 and S2

Each leaf represents either a suffix from one of S1 and S2, or a suffix from both S1 and S2

Mark each internal node v with a 1 (2) if there is a leaf in the subtree of v representing a suffix from S1(S2)

The path-label of any internal node marked both 1 and 2 is a substring common to both S1 and S2, and the longest such string is the longest common substring.

Page 68: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

68

Longest common substring – (3)

S1$ = xabxac$, S2$ = abx$, S = abx

Page 69: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

69

Suffix Arrays – (1) A suffix array of an m-character string S, is an

array of integers in the range 1 to m, specifying the lexicographic order of the m suffixes of S.

Example: S = xabxac

Page 70: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

70

Suffix Arrays – (2) A suffix array of S can be obtained from the

suffix tree T of S by performing a lexical depth-first search of T

Page 71: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

71

Suffix Arrays: Exact string matching

Given two strings T and P, where |T | = m and |P| = n, find all the occurrences of P in T ?

Using the binary search on the suffix array of T , all the occurrences of P in T can be found in O(n logm) time.

Example: let T = xabxac and P = ac

Page 72: Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST

72

References

Dan Gusfield: Algorithms on Strings, Trees, and Sequences. University of California, Davis.Cambridge University Press,1997

Ela Hunt et al.: A database index to large biological sequences. Slides at VLDB, 2001

R.C.T. Lee and Chin Lung Lu.: Suffix Trees. Slides at CS 5313 Algorithms for Molecular Biology