1 suffix trees © jeff parker, 2009. 2 outline an introduction to the suffix tree some sample...

Suffix Trees

Outline

An introduction to the Suffix Tree

Some sample applications

How to build a Suffix Tree efficiently

Problems

We have a corpus of informationGenesProteins

What to see what to sequences have in commonWant to be able to find matches for a gene or protein.Model this as a search for a pattern in a text.

Problem is hard becauseStrings are very long The set of possible matches is large

Today we will focus on exact matches

Pattern MatchingThe basis for the simplest (exact) pattern match follows

Algorithm Line up text and pattern

Compare the two

If they match

Report the position of match

Slide pattern to right and try again

Pattern

Compare pattern at this position

// Does the pattern match the text at this position?boolean compare(String text, int pos, String pattern){

for (int i = 0; i < pattern length; i++)

if (text[pos + i] =/= pattern[i])return false;

return true;}

Simple Pattern Match

// Where is pattern pat in string text?int findMatch (String text, String pat){

int pos = 0;while (pos <= text.length - pat.length){

if (compare(text, pos, pattern))return pos;

pos++; // Slide pattern right one space}return -1;

AnalysisFor pattern of length N and a text of length M

This algorithm behaves well in practice: O(N + M)

The worst case is bad: O(NM)

We can do better if we preprocess

Preprocess Pattern: Boyer-Moore, Knuth, Morris, Pratt

Preprocess text: Suffix Tree

O(|pattern|) Pattern MatchingRather than view the problem as moving the pattern, rephrase

Faster Pattern MatchingIs our pattern the prefix of a suffix of the text string S? Take all suffixes…

Faster Pattern MatchingTake all suffixes and slide left

Faster Pattern MatchingWant to find a string that has pattern as prefix

Sort suffixes

Build TrieAllows O(N) search

for pattern

Suffix Trie

Multi-way tree

Each branch is labeled with char

If the trie is ready, match takes O(|pattern|) time

Example: text S is ababc

s1 = ababc

s2 = babc

s3 = abc

s4 = bc

s5 = c 1

Suffix Trie

Suffix trie takes O(|S|2) space Each step of search for match takes constant time

If no branch matches char, we failLeaf holds name of suffixWe may have multiple matches

String ab occurs twicePrefix of s1 and s3

s1 = ababc

s2 = babc

s3 = abc

s4 = bc

s5 = c

Suffix Tree

Nodes that mark a split are called essentialRemove non-essential nodes, and label edges with string

1 3 2 4

PropertiesTree has

|S| leaves and 2|S|-1 edges|S|-1 interior nodes

Algorithm for search is the same: walk the tree matching edgesWhile this has less nodes, not clear that we need

Any less storage? Sum of length of strings can still be O(N2)Any speedup building tree?

Storage is easier to address

1 3 2 4

Worst Case StorageHere are some trees that need O(N2) storage when stored as tries

abcdefgWe can get a trie that need O(N2) storage with a limited alphabet:

anbnanbnc

abcdefgbcdefg

3cdefg defg efg

Efficient StorageWe store the whole string once, and keep pointers to that string in nodes

We have constant space per node and O(|S|) nodes, thus linear space

1 3 2 4

sibling

Applications: Longest RepeatAs well as searching for a string, we can answer questions such asWhat is the longest string that is duplicated?

What is the longest string that occurs k times?Internal nodes mark repeating substringsKeep track of the splits, and remember the deepest.

In our example, s1 and s3 share ab

1 3 2 4

Longest Common SubstringGiven two strings S and T, find the longest common substringBuild the suffix tree for the string S$T

Mark leaves of suffixes that begin in S redMark leaves of suffixes that begin in T black

Make bottom up traverse, looking for lowest split that has leaves in both sets

1 3 2 4

Applications: Longest PalindromeGiven two strings S, find the longest common palindromeBuild the suffix tree for the string S$S-1

Mark leaves of suffixes that begin in S redMark leaves of suffixes that begin in S-1 black

Look for lowest split that has leaves in both sets

1 3 2 4

Linear Time ConstructionThere is a long history of work

mississippi

ississippi

ssissippisissippiissippissippisippiippippipii

Weiner 1973

mississippi

ississippi

Weiner 1973

McCreight 1976

mississippi

ississippi

Weiner 1973

McCreight 1976

Ukkonen 1992

McCreight

Add the suffixes from longest to shortest

We add a termination symbol, such as $, that does not appear in text

This forces each addition to split the existing tree

We can split (add a node and two edges) in constant time

Can we find the place to do the splitting in constant time?

Suffix links give amortized linear time. But first understand alg.

babcababc

babcab

UkkonenOnline algorithm: we don’t need to know all of string

Grow all suffixes together. In step k, add S[k] to end of each suffix

At some point, string sk will split from tree (s2 breaks loose in step 2)

After that, sk will never split again (though something may split from it)

A split for sk may mean an similar split for sk+1

3 splits when adding c: s3 splits from s1, s4 from s2 and s5 from root

ba..a..

1 2 2 2

ReviewIntroduce graphical notation for implicit nodes

aba means both suffixes “a” and “aba” are on edge

1 2 2 2

ababc$ = s1babc$ = s2abc$ = s3bc$ = s4c$ = s5$ = s6

Mississippimississippi$ = s1

ississippi$ = s2

ssissippi$ = s3

sissippi$ = s4

issippi$ = s5

ssippi$ = s6

sippi$ = s7

ippi$ = s8

ppi$ = s9

pi$ = s10

i$ = s11

$ = s12

mis...

miss...

iss...

s4 is an implicit node

s4 is the active path

Def: First non-leaf suffix remaining

ississippi$ = s2

ssissippi$ = s3

sissippi$ = s4

issippi$ = s5

ssippi$ = s6

sippi$ = s7

ippi$ = s8

ppi$ = s9

pi$ = s10

i$ = s11

$ = s12

miss...

iss...

s4 is an implicit node (red s in s3 edge)

Def: First non-leaf suffix remaining

When we add s[5] = i, active path s4 splits

s5 becomes the active point.

missi...

issi...

ississippi$ = s2

ssissippi$ = s3

sissippi$ = s4

issippi$ = s5

ssippi$ = s6

sippi$ = s7

ippi$ = s8

ppi$ = s9

pi$ = s10

i$ = s11

$ = s12

(First non-leaf suffix remaining)

At end there are 3 non-leaf-suffixes (s5, s6, s7)

missi...

issi...i...

missis...

sis...

issis...is...

mississ...

siss...

ississ...

iss... 4

mississippi$ = s1

ississippi$ = s2

ssissippi$ = s3

sissippi$ = s4

issippi$ = s5

ssippi$ = s6

sippi$ = s7

ippi$ = s8

ppi$ = s9

pi$ = s10

i$ = s11

$ = s12

Add p. Have never seen p, so all 4 (now 5) trailing suffixes split

s10, at root, becomes active path

Mississippi

mississi...

sissi...

ississi...

issi... 4 3 1

1 2 5 7 9 4 6 3

mississip...

ssip...

ssip...p...

p...ssip...

mississippi$ = s1

ississippi$ = s2

ssissippi$ = s3

sissippi$ = s4

issippi$ = s5

ssippi$ = s6

sippi$ = s7

ippi$ = s8

ppi$ = s9

pi$ = s10

i$ = s11

$ = s12Redraw last diagram. About to add a second p.

s10 is active path, and it is at root

Mississippi

2 5 7 9 4 6 3

pmississip

p ssipp

mississippi$ = s1

ississippi$ = s2

ssissippi$ = s3

sissippi$ = s4

issippi$ = s5

ssippi$ = s6

sippi$ = s7

ippi$ = s8

ppi$ = s9

pi$ = s10

i$ = s11

$ = s12

Active path is still s10 It is trailing s9

Mississippi

2 5 7 9 4 6 3

issipp

ppmississipp

pp ssipppp

mississippi$ = s1

ississippi$ = s2

ssissippi$ = s3

sissippi$ = s4

issippi$ = s5

ssippi$ = s6

sippi$ = s7

ippi$ = s8

ppi$ = s9

pi$ = s10

i$ = s11

$ = s12

Add i. Forces split of s10 from s9. Active path is now s11

Mississippi

2 5 7 9 4 6 3

issippi

pmississippi

ppi ssippippi

ssippi

Algorithm

We are building a tree, adding character S[k] to every suffixWe traverse the boundary path - the growing edge of tree

Boundary path includesSuffixes that have already become leavesSuffixes that currently end in implicit interior nodes

We add character S[k] to the end of each suffixIn general we have O(N) suffixes on boundary path, and we add each of N characters to

each suffix on the boundary path, and we must navigate from suffix to suffix, which may be O(N) steps apart.

How can we do this in O(N) time?

Algorithm

We have O(N) suffixes on boundary path,

We add each of N characters to each suffix on the boundary path,

We navigate from suffix to suffix, which may be O(N) steps apart.

How can we do this in O(N) time?

Ans: We cheat. Here are three big ideas (will explain each in detail)

1) Once a path has split off, updating it is free, so we ignore it

2) Rather than “walk” the boundary edge as we add a new character, we only need to watch one representative: the active path - the longest suffix that is not yet a leaf

3) When we do need to walk the boundary path there is a cheap way to walk from suffix to suffix, by creating suffix links

Leaves are Cheap

1) Once a path has split off, “updating” it is freeWe represent a leaf that splits at character S[k] as the string

S[k..whatever]If some later suffix is following our path, it is up to him to find the point

of difference

S5 is following S2, but S2 is a leaf and does not careWe don’t even need to know the length of the string (whatever)

mississi...

ississi...

issi... 4 3 1

mississippi$ = s1

ississippi$ = s2

ssissippi$ = s3

sissippi$ = s4

issippi$ = s5

ssippi$ = s6

Active Path2) We can focus our attention on the longest suffix that has not yet broken

free, called the active path. This represents rest of boundary path

Assume active path is the suffix Si and we are have just added char S[k]

Assume that Si is a prefix of suffix Sj up to this point

Then Si+1 is a prefix of suffix Sj+1 and so on

Proof: Si+1 is just Si without character S[i]The converse is not true.

Si may leave the tree while Si+1 remains in the tree

S[i..k]Si

S[j..k]Sj

S[i+1..k]Si+1

S[j+1..k]Sj+1This means that we only need to watch S5

mississi...

ississi...

issi... 4 3 1

mississippi$ = s1

ississippi$ = s2

ssissippi$ = s3

sissippi$ = s4

issippi$ = s5

ssippi$ = s6

sippi$ = s7

ippi$ = s8

ppi$ = s9

pi$ = s10

i$ = s11

$ = s12

Add p. Have never seen p, so s5, s6, s7, s8 and s9 all split.

s10, which is currently at the root, becomes the new active path

Review example

mississi...

sissi...

ississi...

issi... 4 3 1

1 2 5 7 9 4 6 3

mississip...

ssip...

ssip...p...

p...ssip...

S5 is a prefix of S2

Suffix Links

3) There is a cheap way to walk the boundary path Once the active path splits, we need to walk the boundary path until splitting stopsTo explain the suffix link, return to our view as a trie for ababcWe have inserted s[1] through s[4], about to insert s[5] = c

s1 points to s2, which points to s3, which points to s4, which points to root

I know I will have no problems with leaves s1 and s2 : active path is s3

When I find that s3 needs to split from s1, I need to check s4 as well, and perhaps s5

I follow the suffix pointers from s3

Accounting

I add one character at a time to one suffix - the active pathThis is clearly linear

When the active path splits, I need to start walking the boundary path from old active path to new end path (point were the splitting stops)

Any individual character may cause lots of splitting, but each suffix only splits once. Amortized cost is linear

To walk the boundary path, I update the suffix links. This can also be amortized.

Building Suffix Links

When we split, we need to add new nodes

These nodes will need new suffix links

We are showing a chain of suffix links

Canonize

We represent a suffix as an explicit node and a (growing) string of characters

Start with (n1 (a))

Add characters bbac to get (n1 (abbac))

We canonize this in a sequence of steps to get a better representation

(n2 (bac))

(n3 (c))

This allows us to use the suffix link at n3 rather than the suffix link at n1

Post mortem

Algorithm to build Suffix Tree is linear in time and space.

We haven’t proved this, but perhaps it is now plausible

But is the algorithm practical?

There are real issues when dealing with long strings

The human genome has about 3 billion base pairs

Keeping the suffix links updated can cause thrashing as we walk all over the suffix tree representing this

The suffix tree is important enough that people are working the issue

One idea that is easy to describe: merging suffix trees

References

A great reference to the field is Dan Gusfield’s Algorithms on Strings, Trees, and Sequences

P. Weiner (1973). "Linear pattern matching algorithm". 14th Annual IEEE Symposium on Switching and Automata Theory, 1-11.

Edward M. McCreight (1976). "A Space-Economical Suffix Tree Construction Algorithm". Journal of the ACM 23 (2): 262--272.

E. Ukkonen (1995). "On-line construction of suffix trees". Algorithmica 14 (3): 249--260.

R. Giegerich and S. Kurtz (1997). "From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction". Algorithmica 19 (3): 331--353.

Gusfield, Dan [1997] (1999). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. USA: Cambridge University Press.

1 suffix trees © jeff parker, 2009. 2 outline an introduction to the suffix tree some sample...

Documents

manage dynamics efficiently - kathrein · manage dynamics...

a suffix tree approach to anti-spam email filtering -...

compressed suffix arrays and suffix trees with

suffix tree applications

suffix tree and suffix array r9292202 5 brain chen r9254802...

suffix trees, suffix arrays and suffix trays richard cole...

dimitrios katsaros* † yannis manolopoulos* † aristotle...

iddo-mccreight_slides on suffix tree updates

ws 2006-07 prof. dr. th. ottmann algorithmentheorie 09 -...

cse 549: suffix tries & suffix trees

psist: indexing protein structures using suffix...

enhancing graph database indexing by suffix tree...

genome-scale disk-based suffix tree indexing benjarath...

presented by dr. shazzad hosain asst. prof. eecs, nsu linear...

an enhanced suffix tree approach to measure semantic...

a new suffix tree similarity measure for document clustering...

a suffix tree approach to anti-spam email filtering

a new keyphrases extraction method based on suffix tree data...

presentation for cmpe-521 vist – virtual suffix tree...

suffix trees and suffix arrays