suffix trees. 2 outline introduction suffix trees (st) building sts in linear time: ukkonen’s...

Suffix Trees

2

Outline

Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm

Applications of ST

3

Introduction

4

Substrings String is any sequence of characters. Substring of string S is a string composed

of characters i through j, i j of S. S = cater;ate is a substring.car is not a substring. Empty string is a substring of S.

5

Subsequences

Subsequence of string S is a string composed of characters i1 < i2 < … < ik of S. S = cater; ate is a subsequence.car is a subsequence.Empty string is a subsequence of S.

6

String/Pattern Matching - I You are given a source string S. Suppose we have to answer queries of the

form: is the string pi a substring of S? Knuth-Morris-Pratt (KMP) string matching.

O(|S| + | pi |) time per query.

O(n|S| + i | pi |) time for n queries.

Suffix tree solution.O(|S| + i | pi |) time for n queries.

7

String/Pattern Matching - II KMP preprocesses the query string pi,

whereas the suffix tree method preprocesses the source string (text) S.

The suffix tree for the text is built in O(m) time during a pre-processing stage; thereafter, whenever a string of length O(n) is input, the algorithm searches it in O(n) time using that suffix tree.

String Matching: Prefixes & Suffixes

Substrings of S beginning at the first position of S are called prefixes of S, and substrings that end at its last position are called suffixes of S.

S=AACTAGPrefixes: AACTAG,AACTA,AACT,AAC,AA,ASuffixes: AACTAG,ACTAG,CTAG,TAG,AG,G

pi is a substring of S iff pi is a prefix of some suffix of S.

9

Suffix Trees

10

Definition: Suffix Tree (ST) T for S of length m

1. A rooted tree with m leaves numbered from 1 to m.

2. Each internal node, excluding the root, of T has at least 2 children.

3. Each edge of T is labeled with a nonempty substring of S.

4. No two edges out of a node can have edge-labels starting with the same character.

5. For any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S, namely S[i,m], that starts at position i.

11

Example: Suffix Tree for S=xabxac

12

Existence of a suffix tree S

If one suffix Sj of S matches a prefix of another suffix Si of S, then the path for Sj would not end at a leaf.

S = xabxa S1 = xabxa and S4 = xa How to avoid this problem?

Assume that the last character of S appears nowhere else in S. Add a new character $ not in the alphabet to the end of S.

13

Example: Suffix Tree for S=xabxac$

14

Building STs in linear time:

Ukkonen’s algorithm

15

Building STs in linear time

Weiner’s algorithm [FOCS, 1973] ”The algorithm of 1973” called by Knuth First algorithm of linear time, but much space

McGreight’s algorithm [JACM, 1976] Linear time and quadratic space More readable

Ukkonen’s algorithm [Algorithmica, 1995] Linear time algorithm and less space This is what we will focus on

16

Implicit Suffix Trees Ukkonen’s algorithm constructs a sequence of implicit STs, the

last of which is converted to a true ST of the given string. An implicit suffix tree for string S is a tree obtained from the suffix

tree for S$ by removing $ from all edge labels removing any edges that now have no label removing any node that does not still have at least two children

An implicit suffix tree for prefix S[1,i] of S is similarly defined based on the suffix tree for S[1,i]$.

Ii will denote the implicit suffix tree for S[1,i]. Each suffix is in the tree, but may not end at a leaf.

17

Example: Construction of the Implicit ST

Implicit tree for xabxa from tree for xabxa$ {xabxa$, abxa$, bxa$, xa$, a$, $}

1

bb

bx

x

xx

a

aa

aa$

$

$

$

$$

23

6

5

4

18

Construction of the Implicit ST: Remove $

Remove $ {xabxa$, abxa$, bxa$, xa$, a$, $}

1

bb

bx

x

xx

a

aa

aa$

$

$

$

$$

23

6

5

4

19

Construction of the Implicit ST: After the Removal of $

Remove $ {xabxa, abxa, bxa, xa, a}

1

bb

bx

x

xx

a

aa

aa

23

6

5

4

20

Construction of the Implicit ST: Remove unlabeled edges

Remove unlabeled edges {xabxa, abxa, bxa, xa, a}

1

bb

bx

x

xx

a

aa

aa

23

6

5

4

21

Construction of the Implicit ST: After the Removal of Unlabeled Edges

Remove unlabeled edges {xabxa, abxa, bxa, xa, a}

1

bb

bx

x

xx

a

aa

aa

23

22

Construction of the Implicit ST: Remove interior nodes

Remove internal nodes with only one child {xabxa, abxa, bxa, xa, a}

1

bb

bx

x

xx

a

aa

aa

23

23

Construction of the Implicit ST: Final implicit tree

Remove internal nodes with only one child {xabxa, abxa, bxa, xa, a}

1

b b

bx

x

xxa a a

aa

23

24

Ukkonen’s Algorithm (UA) Ii is the implicit suffix tree of the string S[1, i] Construct I1 /* Construct Ii+1 from Ii */ for i = 1 to m-1 do /* phase i+1 */

for j = 1 to i+1 do /* extension j */ Find the end of the path P from the root whose

label is S[j, i] in Ii and extend P with S[i+1] by suffix extension rules;

Convert Im into a suffix tree S

25

Example S = xabxacd$ i+1 = 1

x i+1 = 2

extend x to xa a

i+1 = 3 extend xa to xab extend a to ab b

…

26

Extension Rules Goal: extend each S[j,i] into S[j,i+1] Rule 1: S[j,i] ends at a leaf

Add character S(i+1) to the end of the label on that leaf edge Rule 2: S[j,i] doesn’t end at a leaf, and the following

character is not S(i+1) Split a new leaf edge for character S(i+1) May need to create an internal node if S[j,i] ends in the

middle of an edge Rule 3: S[j,i+1] is already in the tree

No update

27

Example: Extension Rules

Implicit tree for axabxb from tree for axabx

1

bb

bx x

x

xa a

a

2

3x

bx4b

b

b

b

Rule 1: at a leaf node

b

Rule 3: already in treeRule 2: add a leaf edge (and an interior node)

b

5

28

UA for axabxc (1)

S[1,3]=axa

E S(j,i) S(i+1)

1 ax a2 x a3 a

29

UA for axabxc (2)

30

UA for axabxc (3)

31

UA for axabxc (4)

32

Observations Once S[j,i] is located in the tree, extension rules take only

constant time Naively we could find the end of any suffix S[j,i] in O(S[j,i]) time

by walking from the root of the current tree. By that approach, Im could be created in O(m3) time.

Making Ukkonen’s algorithm O(m) Suffix links Skip and count trick Edge-label compression A stopper Once a leaf, always a leaf

33

Suffix Links Consider the two strings a and xa Suppose some internal node v of the tree is labeled with xa

and another node s(v) in the tree is labeled with a The edge (v,s(v)) is called a suffix link Do all internal nodes (the root is not considered an internal

node) have suffix links?

34

Example: suffix links

S = ACACACAC$

AC

AC$

AC

AC

$ $

$

$

$

$

$

C

AC

AC

AC$

35

Suffix Link Lemma If a new internal node v with path-label xa is

added to the current tree in extension j of some phase i+1, then the path labeled a already ends at an internal

node of the tree or the internal node labeled a will be created in

the extension of j+1 in the same phase i+1string a is empty and s(v) is the root

36

Proof of Suffix Link Lemma A new internal node is created only by the extension rule

2 This means that there are two distinct suffixes of S[1,i+1]

that start with xaxaS(i+1) and xacb where c is not S(i+1)

This means that there are two distinct suffixes of S[1,i+1] that start with a aS(i+1) and acb where c is not S(i+1)

Thus, if a is not empty, a will label an internal node once extension j+1 is processed which is the extension of a

37

Corollary of Suffix Link Lemma Every internal node of an implicit suffix

tree has a suffix link from it.

38

How to use suffix links - 1 S[1,i] must end at a leaf since it is the longest

string in the implicit tree Ii Keep a pointer to this leaf in all cases and extend

according to rule 1 Locating S[j+1,i] from S[j,i] which is at node w

If w is an internal node, set v to w Otherwise, set v = parent(w) If v is the root, you must traverse from the root to find

S[j+1,i] If not, go to s(v) and begin search for the remaining

portion of S[j,i] from there

39

How to use suffix links - 2

40

Skip and Count Trick – (1) Problem: Moving down from s(v), directly

implemented, takes time proportional to the number of characters compared

Solution: To make running time proportional to the number of nodes in the path searched, instead of the number of characters

41

Skip and Count Trick – (2) After 4 nodes down-skips, the end of S[j, i]

is found.

42

Skip and Count Trick – (3)

Node-depth of v, denoted (ND(v)), is the number of nodes on the path from the root to the node v

Lemma: For any suffix link (v, s(v)) traversed in Ukkonen’s algorithm, at that moment, ND(v) ND(s(v))+1

43

Skip and Count Trick – (4) At the moment of traversing (v,s(v)):

ND(v) ND(s(v))+1

44

Skip and Count Trick – (5) The current node-depth of the algorithm is the node

depth of the node most recently visited by the algorithm Lemma: Using the skip and count trick, any phase of

Ukkonen’s algorithm takes O(m) time. Up-walk: decreases the current node-depth by 1 Suffix link traversal: same as up-walk Totally, the current node-depth is decreased by 2m. No node has depth >m. The total possible increment to the current node-depth is

3m.

45

Edge Label Representation Potential Problem

Size of edge labels may require (m2) space Thus, the time for the algorithm is at least as large as the size of

its output Example

S = abcdefghijklmnopqrstuvwxyz Total length is j<m+1 j = m(m+1)/2

Similar problem can happen when the length of the string is arbitrarily larger than the alphabet size

Solution Label edges with pair of indices indicating beginning and end of

positions of the substring in S

46

Modified Extension Rules

Rule 2: new leaf edge (phase i+1) create edge (i+1, i+1) split edge (p, q) => (p, w) and (w + 1, q)

Rule 1: leaf edge extension label had to be (p,i) before extension given rule 2 above and an induction argument:

(p, q) => (p, q + 1)

Rule 3 Do nothing

47

Full edge label representation

String S = xabxa$

1

bb

bx

x

xx

a

aa

aa$

$

$

$

$$

23

6

5

4

48

Edge-label Compression

String S = xabxa$

1

(1,2) [or (4,5)?]

(3,6)

(2,2)(6,6)

23

65

4

(3,6)(6,6)

(6,6) (3,6)

49

A Stopper In any phase, if suffix extension rule 3 applies in

extension j, it will also apply in all extensions k, where k>j, until the end of the phase.

The extensions in phase i+1 that are done after the first execution of rule 3 are said to be done implicitly. This is in contrast to any extension j where the end of S[j, i] is explicitly found. An extension of that kind is called and explicit extension.

Hence, we can end any phase i+1 when the first extension rule 3 applies.

50

Once a leaf, always a leaf – (1) If at some point in UA a leaf is created and

labeled j (for the suffix starting at position j of S), then that leaf will remain a leaf in all successive trees created during the algorithm.

In any phase i, there is an initial sequence of consecutive extensions (starting with extension 1) in which only rule 1 or 2 applies, where let ji be the last extension in this sequence.

Note that ji ji+1.

51

Once a leaf, always a leaf – (2)

Implicit extensions: for extensions 1 to ji, write [·, e] on the leaf edge, where e is a symbol denoting the current end and is set to i + 1 once at the beginning. In later phases, we will not need to explicitly extend this leaf but rather can implicitly extend it by incrementing e once in its global location.

Explicit extensions: from extension ji+1 till first rule 3 extension is found (or until extension i+1 is done)

52

Once a leaf, always a leaf – (3): Single phase algorithm

j* is the last explicit extension computed in phase i+1 Phase i+1

Increment e to i+1 (implicitly extending all existing leaves) Explicitly compute successive extensions starting at ji+1 and

continuing until reaching the first extension j* where rule 3 applies or no more extensions needed

Set ji+1 to j* -1, to prepare to the next phase Observation

Phase i and i+1 share at most 1 explicit extension

53

Example: S=axaxbb$ - (1) e = 1, a J1 = 1

1

1

(1,e)

I1

e = 2, ax S[1,2] : skip S[2,2] : rule 2, create(2, e) j2 = 2

1

1

(1,e)

I2

2

(2,e)

e = 3, axa S[1,3] .. S[2,3] : skip S[3,3] : rule 3 j3 = 2

1

1

(1,e)

I3

2

(2,e)

54

Example: S=axaxbb$ - (2) e = 4, axax S[1,4] .. S[2,4] : skip S[3,4] : rule 3 S[4,4] : auto skip j4 = 2

1

1

(1,e)

I4

2

(2,e)

e = 5, axaxb S[1,5] .. S[2,5] : skip S[3,5] : rule 2, split (1,e)

→ (1, 2) and (3,e), create (5,e) S[4,5] : rule 2, split (2,e)

→ (2,2) and (3,e), create (5,e) S[5,5] : rule 2, create (5,e) j5 = 5

1(1,2)

I5

2

(2,2)

1 3

(3,e)(5,e)

4

(3,e)(5,e)

(5,e)

5

55

Example: S=axaxbb$ - (3)

e = 6, axaxbb S[1,6] .. S[5,6] : skip S[6,6] : rule 3 j6 = 5

e = 7, axaxbb$ S[1,7] .. S[5,7] : skip S[6,7] : rule 2, split (5,e)

→ (5,5) and (6,e), create (6,e) S[7,7] : rule 2, create (7,e) j7 = 7

1(1,2)

I6

2

(2,2)

1 3

(3,e)(5,e)

4

(3,e)(5,e)

(5,e)

5

1(1,2)

I7

2

(2,2)

1 3

(3,e)(5,e)

4

(3,e)(5,e)

(5,5)

5

6(6,e)

(7,e)

(7,e)7

56

Complexity of UA Since all the implicit extensions in any phase is constant,

their total cost is O(m). Totally, only 2m explicit extensions are executed. The max number of down-walking skips is O(m). Time-complexity of Ukkonen’s algorithm: O(m)

57

Finishing up

Convert final implicit suffix tree to a true suffix tree

Add $ using just another phase of executionNow all suffixes will be leaves

Replace e in every leaf edge with mJust requires a traversal of tree which is O(m)

time

58

Implementation Issues – (1) When the size of the alphabet grows:

For large trees suffix links allow an algorithm to move quickly from one part of the tree to a distant part of the tree. This is great for worst-case time bounds, but its horrible if the tree isn't entirely in memory.

Thus, implementing ST to reduce practical space use can be a serious concern.

The main design issues are how to represent and search the branches out of the nodes of the tree.

A practical design must balance between constraints of space and need for speed

59

Implementation Issues – (2) There are four basic choices to represent branches

An array of size (||) at each non-leaf node v A linked list at node v of characters that appear at the beginning of the

edge-labels out of v. If its kept in sorted order it reduces the average time to search for a

given character In the worst case it, adds time || to every node operation. If the

number of children of v is large, then little space is saved over the array while noticeably degrading performance

A balanced tree implements the list at node v Additions and searches take O(logk) time and O(k) space, where k is

the number of children of v. This alternative makes sense only when k is fairly large.

A hashing scheme. The challenge is to find a scheme balancing space with speed. For large trees and alphabets hashing is very attractive at least for some of the nodes

60

Implementation Issues – (3)

When m and are large enough, the best design is probably a mixture of the above choices. Nodes near the root of the tree tend to have the most children,

so arrays are sensible choice at those nodes. For nodes in the middle of a suffix tree, hashing or balanced

trees may be the best choice.

Sometimes the alphabet size is explicitly presented in the time and space bounds Construction time is O(m log||),using (m||) space. m – is the

size of the string.

61

Applications of Suffix Trees

62

Applications of Suffix Trees

Exact string matching Substring problem for a database Longest common substring Suffix arrays

63

Exact String Matching – (1) Exact matching problem: given a pattern P of length n

and a text T of length m, find all occurrences of P in T in O(n+m) time.

Overview of the ST approach: Built a ST for text T in O(m) time Match the characters of P along the unique path in ST until either

P is exhausted or no more matches are possible In the latter case, P doesn’t appear anywhere in T In the former case, every leaf in the subtree below the point of the

last match is numbered with a starting location of P in T, and every starting location of P in T numbers such a leaf

ST approach spends O(m) preprocessing time and then O(n+k) search time, where k is the number of occurrences of P in T

64

Exact String Matching – (2)

When search terminates at an element node, P appears exactly once in the source string T.

When the search for P terminates at a branch node, each element node in the subtree rooted at this branch node gives a different occurrence of P.

6

x

14253

a

b

x a

c

c

a

b x a c

c

c

b

x

a

c

Notes:

P = xa; T = xabxac

Pattern P matches a path down to the point shown by an arrow, and as required, the leaves below that point are numbered 1 and 4.

65

Substring Problem for a Database Input: a database D (a set of strings) and a string S Output: find all the strings in D containing S as a

substring Usage: identity of the person Exact string matching methods: cannot work Suffix tree:

D is stored in O(m) space, where m is the total length of all the strings in D.

Suffix tree is built in O(m) time Each lookup of S (the length of S is n) costs O(n) time

66

Longest common substring –(1) Input: two strings S1 and S2

Output: find the longest substring S common to S1 and S2

Example: S1=common-substring

S2=common-subsequence

Then, S=common-subs

67

Longest common substring – (2) Build a suffix tree for S1 and S2

Each leaf represents either a suffix from one of S1 and S2, or a suffix from both S1 and S2

Mark each internal node v with a 1 (2) if there is a leaf in the subtree of v representing a suffix from S1(S2)

The path-label of any internal node marked both 1 and 2 is a substring common to both S1 and S2, and the longest such string is the longest common substring.

68

Longest common substring – (3)

S1$ = xabxac$, S2$ = abx$, S = abx

69

Suffix Arrays – (1) A suffix array of an m-character string S, is an

array of integers in the range 1 to m, specifying the lexicographic order of the m suffixes of S.

Example: S = xabxac

70

Suffix Arrays – (2) A suffix array of S can be obtained from the

suffix tree T of S by performing a lexical depth-first search of T

71

Suffix Arrays: Exact string matching

Given two strings T and P, where |T | = m and |P| = n, find all the occurrences of P in T ?

Using the binary search on the suffix array of T , all the occurrences of P in T can be found in O(n logm) time.

Example: let T = xabxac and P = ac

72

References

Dan Gusfield: Algorithms on Strings, Trees, and Sequences. University of California, Davis.Cambridge University Press,1997

Ela Hunt et al.: A database index to large biological sequences. Slides at VLDB, 2001

R.C.T. Lee and Chin Lung Lu.: Suffix Trees. Slides at CS 5313 Algorithms for Molecular Biology

suffix trees. 2 outline introduction suffix trees (st) building sts in linear time: ukkonen’s...

Documents