suxtrees - bioinf.uni-freiburg.decosta/suffix_trees.pdf · blue arrows = suffix links sl(w)=v w =...

25
Sux Trees Rolf Backofen Lehrstuhl f¨ ur Bioinformatik Institut f¨ ur Informatik Course Bioinformatics II — WS 11/12

Upload: dangcong

Post on 28-Mar-2019

213 views

Category:

Documents


0 download

TRANSCRIPT

Su�x Trees

Rolf Backofen

Lehrstuhl f

¨

ur Bioinformatik

Institut f

¨

ur Informatik

Course Bioinformatics II — WS 11/12

String Matching

find e�ciently all occurrences of a pattern P of length m in atext T of length n

Counting query: reports the number of occurrences of P in TReporting query: reports all occurrences of P in T

string matching can be solved with a su�x tree

advantage over other string-matching algorithms:

if T is static, the su�x tree is constructed once in a preprocessingstepthe subsequent string matchings are then “fast“

(m << n)
where

Definitions

T = t1

t2

. . . tn

Definition

The substring t1

...ti is called the i-th prefix of T (1 i n).

Example: T=ACCTTCCT

first prefix: A

fourth prefix: ACCT

Definition

The substring ti ...tn is called the i-th su�x of T (1 i n).

Example: T=ACCTTCCT

first su�x: ACCTTCCT

fourth su�x: TTCCT

Definition

Su�x Tree

A su�x tree for a text T of length n over the alphabet ⌃ is a rooteddirected tree with n leaves. Apart from the root node, all internalnodes have at least two children. All edges are labeled with anon-empty substring of T and all outgoing edges from a node startwith a di↵erent character. Each leaf in the su�x tree is labeled withan integer i 2 {1 . . . n} such that the concatenation of the ege labelson the path from the root to the leaf node spells out the su�x of Tthat starts at position i . The su�x tree can be constructed in O(n)time and requires O(n) space.

Remark: In order to have a one-to-one correspondence between thesu�xes of T and the leaves of the su�x tree, we add a new character$ 62 ⌃ to the end of T . This ensures that no su�x is a prefix ofanother su�x.

Example for Su�x Tree

T = ACCTTCCT$ Su�xes: ACCTTCCT$

CCTTCCT$

CTTCCT$

TTCCT$

TCCT$

CCT$

CT$

T$

$

T

CT$

C

AC

TTCC

C

T$

C

T

T

9

$

$

CCT$

TC

CT

$

1

2 7 3

8 5 4

$

6

CT

$

T

CT$

C

Notations

for a node v in the su�x tree, v denotes the concatenation of allpath labels from the root to v

|v | denotes the string depth of a node v

in order to identify a node v in the su�x tree with v = x , wewrite x

a su�x link sl(v) of an internal node v = cb, where c is acharacter and b is a string, is the node w = b

edge
in the path

Searching in a Su�x Tree

Task: find pattern P = p1

. . . pm of length m in the su�x tree for textT of length n

1 set cur node=root and cur char=p1

2 locate the correct outgoing edge from the cur node which startswith cur char

3 match the subsequent characters of the pattern to the label ofthe edge located in step 2 character-by-character until the wholepattern was matched (go to step 4 a)) or one ends up at a nodev . Assume we already matched p

1

. . . pi : set cur node = v andcur char = pi+1

4 repeat step 2 and 3 until:a) the whole pattern was matchedb) there is no outgoing edge that starts with cur char (step 2) or the

subsequent characters of P can not be matched (step 3)

Searching in a Su�x Tree (cont.)

step 4a):the whole pattern was matchedsuppose the search procedure ended at node w or on the incomingedge of node w

) the occurrences of P in T can be found in the subtree rooted at w

step 4b)there is no outgoing edge that starts with cur charthe subsequent characters of P can not be matched

) P does not occur in T

Searching in a Su�x Tree (cont.)

Counting query: reports the number of occurrences of P in T

step 4a): occurrences of the pattern found) return the number of leaves in the subtree rooted at w(assuming that all nodes in the su�x tree are labeled with theirsubtree sizes, this can be done in constant time)step 4b): no occurrence of the pattern found) return 0 (constant time)Runtime for counting query: O(m)

Reporting query: reports all occurrences of P in T

step 4a): occurrences of the pattern found) output the labels of all leaves in the subtree rooted at w in(O(OccP

T )) time, where OccPT is the number of occurrences of P

in Tstep 4b): no occurrence of the pattern found) output the empty set (constant time)Runtime for reporting query: O(m + OccP

T )

costa
costa
i.e. find the final node in m steps

Example for Searching

P=CCT

P=CG

T

CT$

C

AC

TTCC

C

T$

C

T

T

9

$

$

CCT$

TC

CT

$

1

2 7 3

8 5 4

$

6

CT

$

T

CT$

C

Summary

Task

Find pattern P of length m in a text T of length n.

Su�x Tree

The su�x tree for T can be constructed in O(n) time and space.With the su�x tree, the counting query can be solved in O(m) timeand the reporting query in O(m + OccP

T ) time, where OccPT is the

number of occurrences of P in T .

Applications

1 searching for exact patterns (already discussed)

2 find Maximal Unique Matches

3 find all maximal pairs

2. Maximal Unique Matches

We have as an input two sequences A and B.

Definition

an occurrence of the same substring in A and B is called a match

a match in A and B is left (right) maximal if the match cannotbe extended to the left (right), i.e. the characters to theimmediate left (right) di↵er

a Maximal Unique Match (MUM) is a substring that occursexactly once in both A and B and is left and right maximal

Example: MUMs for A=ATGAC and B=AGAGGAC

GAC is a Maximal Unique Match as it occurs only once in A andB and cannot be extended

AG is not a Maximal Unique Match as it occurs twice in B

GA

2. Maximal Unique Matches (cont.)

Why do we need MUMs?) for global alignments of large sequences

a significantly long MUM is almost certain to be part of a globalalignment of the sequences A and B

to get the full alignment we only need to align the sequences inthe gap between the MUMs

How to find e�ciently all MUMs?

generalized su�x tree for the string A#B$

costa
costa
being unique means no ambiguity

2. Maximal Unique Matches (cont.)

leaf labels: firstnumber identifiesthe string and thesecond one thestarting position

observation: we candelete the edgelabel on leaf nodesafter the #

example for A=CGAA and B=CGA,CGAA#CGA$

#C

G

$A

A,4

B,3

wA#CGA$

B,1

A#CGA$

A,1

$ $

AG

$

A,3

A,5

$

CGA

B,4 C #GA$ A

v

B,2

A#CGA$

A,2

blue arrows = suffix links sl(w)=v <==> w = Xv

2. Maximal Unique Matches (cont.)

1 create the generalized su�x tree T for A#B$

2 mark each internal node v of T with exactly two child nodeswhere one is a leaf from A and the other is a leaf from B

3 for each internal node v unmark sl(v)

4 report all marked nodes as Maximal Unique Matches

costa
because it has to occurr only once

2. Maximal Unique Matches (cont.)

#C

G

$A

A,4

B,3

wA#CGA$

B,1

A#CGA$

A,1

$ $

AG

$

A,3

A,5

$

CGA

B,4 C #GA$ A

v

B,2

A#CGA$

A,2

1 create generalized su�x tree for CGAA#CGA$

2 mark nodes v and w

3 unmark node v , as v = sl(w)

4 report node v = CGA as a Maximal Unique Match

mmann
mmann
mmann
w
(blue arrows)
w
(blue arrows)

2. Maximal Unique Matches (cont.)

CGA is a MUM as node w = CGA has exactly one child labeledwith A and one with B and it cannot be extended to the left

GA is no MUM as GA can be extended to the left

example for A=CGAA and B=CGA, CGAA#CGA$

#C

G

$A

A,4

B,3

wA#CGA$

B,1

A#CGA$

A,1

$ $

AG

$

A,3

A,5

$

CGA

B,4 C #GA$ A

v

B,2

A#CGA$

A,2

(no sl)
v = sl(w)
(no sl)
v = sl(w)

3. All maximal pairs

Definition

A maximal pair in a sequence A is a pair of occurrences of thesubstring ↵ in A such that the characters to the immediate left (right)of the two occurrences di↵er (the pair is left and right maximal). Amaximal pair is represented by (i , j , |↵|), where i and j are the startingpositions of the occurrences of ↵.

Example:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16A= A G A C C A G A C A T A G A C A

maximal pair AGAC: (1,6,4) and (1,12,4)

maximal pair AGACA: (6,12,5)

mmann
mmann
mmann
what of (6,12,4) ???
what of (6,12,4) ???
left
and i < j

3. All maximal pairs (cont.)

build su�x tree for sequence A

leaf annotation: in addition to the position i of the su�x, westore the character Ai�1

that occurs immediately before the su�x

T

CT$

C

AC

TTCC

C

T$

T 9

C

T

T

9

$

$

CCT$

TC

CT

$

1

2 7 3

8 5 4

$

6

CT

$

T

CT$

C

T 6 A 2 C 7 C 3

C 8 T 5 C 4

_ 1

costa
ACCTTCCT$ 123456789
ACCTTCCT$ 123456789

3. All maximal pairs (cont.)

observation: a substring ↵ can only be a maximal pair if thecorresponding node ↵ has at least two children () rightmaximal) with di↵erent characters in their annotation () leftmaximal)

How to find all maximal pairs of a node v?

Reporting: for each character x and each child v 0 of v , thecartesian product of the list for x at v 0 with the union of everylist for a character x 0 6= x at a child w 6= v 0 is formed; each pairin this list together with the string depth of v is a maximal pair

Linking: to create the list for character x at node v , we link thelists for character x that exist for each of v 0s children

do a post-order traversal of the nodes in the su�x tree to get allmaximal pairs

costa
costa
(left,right,current)
(left,right,current)
1. how to create lists
2. how to report max pairs for a node
3. how to find all vertices to reportfrom the largest substring to the smallest

3. All maximal pairs (cont.)

T

CT$

C

AC

TTCC

C

T$

T 9

A 2

C

T

T

9

$

$

CCT$

TC

CT

$

1

2 7 3

8 5 4

$

6

CT

$

T

CT$

C

T 6 C 7 C 3

C 8 T 5 C 4

_ 1 wv T 6A 2

for node v = CCT, we report the maximal pair CCT as (2,6,3)

we build the annotation for node v by combining the two leafannotations of the children of v

3. All maximal pairs (cont.)

T

CT$

C

AC

TTCC

C

T$

T 9

A 2

T 6A 2

C 7

C

T

T

9

$

$

CCT$

TC

CT

$

1

2 7 3

8 5 4

$

6

CT

$

T

CT$

C

T 6 C 3

C 8 T 5 C 4

_ 1 wv C 7 3

for node w = CT, we report no maximal pair

we build the annotation for node w by combining the two leafannotations of the children of w

costa

3. All maximal pairs (cont.)

T

CT$

C

AC

TTCC

C

T$

T 9

A 2

T 6A 2

C 7

C

T

T

9

$

$

CCT$

TC

CT

$

1

2 7 3

8 5 4

$

6

CT

$

T

CT$

C

T 6 C 3

C 8 T 5 C 4

_ 1 wv C 7 3

C 4 8T 5

T 6A 2C 7 3

repeat steps for all internal nodes

report the following maximal pairs:1 CCT as (2,6,3)2 C as (6,7,1), (3,6,1), (2,3,1), (2,7,1)3 T as (5,8,1),(4,5,1)

z
z

3. All maximal pairs (cont.)

Runtime analysis

creation of the su�x tree, the post-order traversal, and all the listlinking take O(n) time

each operation of the cartesian product produces an uniquemaximal pair) O(k) time, where k is the number of maximal pairs

in total the algorithm takes O(n + k) time

in many applications we are only interested in maximal pairs of acertain length m) runtime is reduced to O(n + km), where km is the number ofmaximal pairs with length � m

Recommended Reading

Dan Gusfield:Algorithms on Strings, Trees, and Sequences.Cambridge University Press (1997)

A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O.White, and S. L. Salzberg:Alignment of Whole GenomesNucleic Acids Research, 27:2369-2376, 1999