1 suffix trees © jeff parker, 2009. 2 outline an introduction to the suffix tree some sample...
Post on 18-Dec-2015
221 Views
Preview:
TRANSCRIPT
2
Outline
An introduction to the Suffix Tree
Some sample applications
How to build a Suffix Tree efficiently
3
Problems
We have a corpus of informationGenesProteins
What to see what to sequences have in commonWant to be able to find matches for a gene or protein.Model this as a search for a pattern in a text.
Problem is hard becauseStrings are very long The set of possible matches is large
Today we will focus on exact matches
4
Pattern MatchingThe basis for the simplest (exact) pattern match follows
Algorithm Line up text and pattern
Compare the two
If they match
Report the position of match
Else
Slide pattern to right and try again
Text
Pattern
5
Compare pattern at this position
// Does the pattern match the text at this position?boolean compare(String text, int pos, String pattern){
for (int i = 0; i < pattern length; i++)
if (text[pos + i] =/= pattern[i])return false;
return true;}
6
Simple Pattern Match
// Where is pattern pat in string text?int findMatch (String text, String pat){
int pos = 0;while (pos <= text.length - pat.length){
if (compare(text, pos, pattern))return pos;
pos++; // Slide pattern right one space}return -1;
}
7
AnalysisFor pattern of length N and a text of length M
This algorithm behaves well in practice: O(N + M)
The worst case is bad: O(NM)
We can do better if we preprocess
Preprocess Pattern: Boyer-Moore, Knuth, Morris, Pratt
Preprocess text: Suffix Tree
9
Faster Pattern MatchingIs our pattern the prefix of a suffix of the text string S? Take all suffixes…
14
Suffix Trie
Multi-way tree
Each branch is labeled with char
If the trie is ready, match takes O(|pattern|) time
Example: text S is ababc
s1 = ababc
s2 = babc
s3 = abc
s4 = bc
s5 = c 1
a
a
c
b
b
b
b
cc
c
c
3
5
2
4
a
15
Suffix Trie
Suffix trie takes O(|S|2) space Each step of search for match takes constant time
If no branch matches char, we failLeaf holds name of suffixWe may have multiple matches
String ab occurs twicePrefix of s1 and s3
1
a
a
c
b
b
b
b
cc
c
c
3
5
2
4
a
s1 = ababc
s2 = babc
s3 = abc
s4 = bc
s5 = c
16
Suffix Tree
Nodes that mark a split are called essentialRemove non-essential nodes, and label edges with string
1
a
a
c
b
b
b
b
cc
c
c
3
5
2
4
a
5
1 3 2 4
abc
ab
c
c
cabc
b
17
PropertiesTree has
|S| leaves and 2|S|-1 edges|S|-1 interior nodes
Algorithm for search is the same: walk the tree matching edgesWhile this has less nodes, not clear that we need
Any less storage? Sum of length of strings can still be O(N2)Any speedup building tree?
Storage is easier to address
5
1 3 2 4
abc
ab
c
c
cabc
b
18
Worst Case StorageHere are some trees that need O(N2) storage when stored as tries
abcdefgWe can get a trie that need O(N2) storage with a limited alphabet:
anbnanbnc
1 2
abcdefgbcdefg
3cdefg defg efg
4 5
19
Efficient StorageWe store the whole string once, and keep pointers to that string in nodes
We have constant space per node and O(|S|) nodes, thus linear space
5
1 3 2 4
abc
ab
c
c
cabc
b
1, 2
a
1
b
2
a
3
b
4
c
5
2, 2
5, 5
3, 5
5, 5
3, 5
5, 5
sibling
child
20
Applications: Longest RepeatAs well as searching for a string, we can answer questions such asWhat is the longest string that is duplicated?
What is the longest string that occurs k times?Internal nodes mark repeating substringsKeep track of the splits, and remember the deepest.
In our example, s1 and s3 share ab
5
1 3 2 4
abc
ab
c
c
cabc
b
ababc
21
Longest Common SubstringGiven two strings S and T, find the longest common substringBuild the suffix tree for the string S$T
Mark leaves of suffixes that begin in S redMark leaves of suffixes that begin in T black
Make bottom up traverse, looking for lowest split that has leaves in both sets
5
1 3 2 4
abc
ab
c
c
cabc
b
22
Applications: Longest PalindromeGiven two strings S, find the longest common palindromeBuild the suffix tree for the string S$S-1
Mark leaves of suffixes that begin in S redMark leaves of suffixes that begin in S-1 black
Look for lowest split that has leaves in both sets
5
1 3 2 4
abc
ab
c
c
cabc
b
23
Linear Time ConstructionThere is a long history of work
mississippi
ississippi
ssissippisissippiissippissippisippiippippipii
Weiner 1973
24
Linear Time ConstructionThere is a long history of work
mississippi
ississippi
ssissippisissippiissippissippisippiippippipii
Weiner 1973
McCreight 1976
25
Linear Time ConstructionThere is a long history of work
mississippi
ississippi
ssissippisissippiissippissippisippiippippipii
Weiner 1973
McCreight 1976
Ukkonen 1992
26
McCreight
Add the suffixes from longest to shortest
We add a termination symbol, such as $, that does not appear in text
This forces each addition to split the existing tree
We can split (add a node and two edges) in constant time
Can we find the place to do the splitting in constant time?
Suffix links give amortized linear time. But first understand alg.
ababc
2
1
babcababc
1 2
babcab
1
abc c
3
27
UkkonenOnline algorithm: we don’t need to know all of string
Grow all suffixes together. In step k, add S[k] to end of each suffix
At some point, string sk will split from tree (s2 breaks loose in step 2)
After that, sk will never split again (though something may split from it)
A split for sk may mean an similar split for sk+1
3 splits when adding c: s3 splits from s1, s4 from s2 and s5 from root
a...
1
ab...
1
b...
aba..
1
ba..a..
2
abab.
1
bab.
ab.b.
abc
ab 5
1 2 2 2
3 4
bc
cabc
c
28
ReviewIntroduce graphical notation for implicit nodes
aba means both suffixes “a” and “aba” are on edge
a...
1
ab...
1
b...
aba..
1
ba..
2
abab.
1
bab.
abc
ab 5
1 2 2 2
3 4
bc
cabc
c
ababc$ = s1babc$ = s2abc$ = s3bc$ = s4c$ = s5$ = s6
29
Mississippimississippi$ = s1
ississippi$ = s2
ssissippi$ = s3
sissippi$ = s4
issippi$ = s5
ssippi$ = s6
sippi$ = s7
ippi$ = s8
ppi$ = s9
pi$ = s10
i$ = s11
$ = s12
m...
1
mi...
1 2
i...
mis...
1 2
is...
3
s...
miss...
1 2
iss...
3
ss...
s4 is an implicit node
s4 is the active path
Def: First non-leaf suffix remaining
30
Mississippimississippi$ = s1
ississippi$ = s2
ssissippi$ = s3
sissippi$ = s4
issippi$ = s5
ssippi$ = s6
sippi$ = s7
ippi$ = s8
ppi$ = s9
pi$ = s10
i$ = s11
$ = s12
miss...
1 2
iss...
3
ss...
s4 is an implicit node (red s in s3 edge)
s4 is the active path
Def: First non-leaf suffix remaining
When we add s[5] = i, active path s4 splits
s5 becomes the active point.
missi...
1 2
si...
s
issi...
i...
3 4
31
Mississippimississippi$ = s1
ississippi$ = s2
ssissippi$ = s3
sissippi$ = s4
issippi$ = s5
ssippi$ = s6
sippi$ = s7
ippi$ = s8
ppi$ = s9
pi$ = s10
i$ = s11
$ = s12
s5 is the active path
(First non-leaf suffix remaining)
At end there are 3 non-leaf-suffixes (s5, s6, s7)
missi...
1 2
si...
s
issi...i...
missis...
1 2
sis...
s
issis...is...
mississ...
2
siss...
s
ississ...
iss... 4
4
3 4
3
3
1
32
mississippi$ = s1
ississippi$ = s2
ssissippi$ = s3
sissippi$ = s4
issippi$ = s5
ssippi$ = s6
sippi$ = s7
ippi$ = s8
ppi$ = s9
pi$ = s10
i$ = s11
$ = s12
Add i
Add p. Have never seen p, so all 4 (now 5) trailing suffixes split
s10, at root, becomes active path
Mississippi
mississi...
2
sissi...
s
ississi...
issi... 4 3 1
1 2 5 7 9 4 6 3
mississip...
s
p...
8
p...
p...
i
ssi
ssip...
sii
ssip...p...
p...ssip...
33
mississippi$ = s1
ississippi$ = s2
ssissippi$ = s3
sissippi$ = s4
issippi$ = s5
ssippi$ = s6
sippi$ = s7
ippi$ = s8
ppi$ = s9
pi$ = s10
i$ = s11
$ = s12Redraw last diagram. About to add a second p.
s10 is active path, and it is at root
Mississippi
2 5 7 9 4 6 3
8
issip
i
ssi
p
p
pmississip
1
s
i
si
p ssipp
ssip
34
mississippi$ = s1
ississippi$ = s2
ssissippi$ = s3
sissippi$ = s4
issippi$ = s5
ssippi$ = s6
sippi$ = s7
ippi$ = s8
ppi$ = s9
pi$ = s10
i$ = s11
$ = s12
Active path is still s10 It is trailing s9
Mississippi
2 5 7 9 4 6 3
8
issipp
i
ssi
pp
pp
ppmississipp
1
s
i
si
pp ssipppp
ssipp
35
mississippi$ = s1
ississippi$ = s2
ssissippi$ = s3
sissippi$ = s4
issippi$ = s5
ssippi$ = s6
sippi$ = s7
ippi$ = s8
ppi$ = s9
pi$ = s10
i$ = s11
$ = s12
Add i. Forces split of s10 from s9. Active path is now s11
Mississippi
2 5 7 9 4 6 3
8
issippi
i
ssi
ppi
ppi
pmississippi
1
s
i
si
ppi ssippippi
ssippi
10
pii
36
Algorithm
We are building a tree, adding character S[k] to every suffixWe traverse the boundary path - the growing edge of tree
Boundary path includesSuffixes that have already become leavesSuffixes that currently end in implicit interior nodes
We add character S[k] to the end of each suffixIn general we have O(N) suffixes on boundary path, and we add each of N characters to
each suffix on the boundary path, and we must navigate from suffix to suffix, which may be O(N) steps apart.
How can we do this in O(N) time?
37
Algorithm
We have O(N) suffixes on boundary path,
We add each of N characters to each suffix on the boundary path,
We navigate from suffix to suffix, which may be O(N) steps apart.
How can we do this in O(N) time?
Ans: We cheat. Here are three big ideas (will explain each in detail)
1) Once a path has split off, updating it is free, so we ignore it
2) Rather than “walk” the boundary edge as we add a new character, we only need to watch one representative: the active path - the longest suffix that is not yet a leaf
3) When we do need to walk the boundary path there is a cheap way to walk from suffix to suffix, by creating suffix links
38
Leaves are Cheap
1) Once a path has split off, “updating” it is freeWe represent a leaf that splits at character S[k] as the string
S[k..whatever]If some later suffix is following our path, it is up to him to find the point
of difference
S5 is following S2, but S2 is a leaf and does not careWe don’t even need to know the length of the string (whatever)
mississi...
2
s
ississi...
issi... 4 3 1
mississippi$ = s1
ississippi$ = s2
ssissippi$ = s3
sissippi$ = s4
issippi$ = s5
ssippi$ = s6
39
Active Path2) We can focus our attention on the longest suffix that has not yet broken
free, called the active path. This represents rest of boundary path
Assume active path is the suffix Si and we are have just added char S[k]
Assume that Si is a prefix of suffix Sj up to this point
Then Si+1 is a prefix of suffix Sj+1 and so on
Proof: Si+1 is just Si without character S[i]The converse is not true.
Si may leave the tree while Si+1 remains in the tree
S[i..k]Si
S[j..k]Sj
S[i+1..k]Si+1
S[j+1..k]Sj+1This means that we only need to watch S5
mississi...
2
s
ississi...
issi... 4 3 1
40
mississippi$ = s1
ississippi$ = s2
ssissippi$ = s3
sissippi$ = s4
issippi$ = s5
ssippi$ = s6
sippi$ = s7
ippi$ = s8
ppi$ = s9
pi$ = s10
i$ = s11
$ = s12
Add p. Have never seen p, so s5, s6, s7, s8 and s9 all split.
s10, which is currently at the root, becomes the new active path
Review example
mississi...
2
sissi...
s
ississi...
issi... 4 3 1
1 2 5 7 9 4 6 3
mississip...
s
p...
8
p...
p...
i
ssi
ssip...
sii
ssip...p...
p...ssip...
S5 is a prefix of S2
S6 is a prefix of S3
S7 is a prefix of S4
S8 is a prefix of S5
41
Suffix Links
3) There is a cheap way to walk the boundary path Once the active path splits, we need to walk the boundary path until splitting stopsTo explain the suffix link, return to our view as a trie for ababcWe have inserted s[1] through s[4], about to insert s[5] = c
s1 points to s2, which points to s3, which points to s4, which points to root
I know I will have no problems with leaves s1 and s2 : active path is s3
When I find that s3 needs to split from s1, I need to check s4 as well, and perhaps s5
I follow the suffix pointers from s3
a
a
b
b
b
b
a
42
Accounting
I add one character at a time to one suffix - the active pathThis is clearly linear
When the active path splits, I need to start walking the boundary path from old active path to new end path (point were the splitting stops)
Any individual character may cause lots of splitting, but each suffix only splits once. Amortized cost is linear
To walk the boundary path, I update the suffix links. This can also be amortized.
a
a
b
b
b
b
a
a
a
b
b
b
b
ac c
c
43
Building Suffix Links
When we split, we need to add new nodes
These nodes will need new suffix links
We are showing a chain of suffix links
44
Building Suffix Links
When we split, we need to add new nodes
These nodes will need new suffix links
45
Building Suffix Links
When we split, we need to add new nodes
These nodes will need new suffix links
46
Building Suffix Links
When we split, we need to add new nodes
These nodes will need new suffix links
47
Building Suffix Links
When we split, we need to add new nodes
These nodes will need new suffix links
48
Building Suffix Links
When we split, we need to add new nodes
These nodes will need new suffix links
49
Building Suffix Links
When we split, we need to add new nodes
These nodes will need new suffix links
50
Building Suffix Links
When we split, we need to add new nodes
These nodes will need new suffix links
51
Canonize
We represent a suffix as an explicit node and a (growing) string of characters
Start with (n1 (a))
Add characters bbac to get (n1 (abbac))
We canonize this in a sequence of steps to get a better representation
(n2 (bac))
(n3 (c))
This allows us to use the suffix link at n3 rather than the suffix link at n1
abba
ca
n1
n2
n3
n4
52
Post mortem
Algorithm to build Suffix Tree is linear in time and space.
We haven’t proved this, but perhaps it is now plausible
But is the algorithm practical?
There are real issues when dealing with long strings
The human genome has about 3 billion base pairs
Keeping the suffix links updated can cause thrashing as we walk all over the suffix tree representing this
The suffix tree is important enough that people are working the issue
One idea that is easy to describe: merging suffix trees
53
References
A great reference to the field is Dan Gusfield’s Algorithms on Strings, Trees, and Sequences
P. Weiner (1973). "Linear pattern matching algorithm". 14th Annual IEEE Symposium on Switching and Automata Theory, 1-11.
Edward M. McCreight (1976). "A Space-Economical Suffix Tree Construction Algorithm". Journal of the ACM 23 (2): 262--272.
E. Ukkonen (1995). "On-line construction of suffix trees". Algorithmica 14 (3): 249--260.
R. Giegerich and S. Kurtz (1997). "From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction". Algorithmica 19 (3): 331--353.
Gusfield, Dan [1997] (1999). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. USA: Cambridge University Press.
top related