csc 448: bioninformatics algorithms alex dekhtyar ukkonen’s algorithm for generalized suffix trees
TRANSCRIPT
CSC 448: Bioninformatics Algorithms
Alex Dekhtyar
Ukkonen’s Algorithm for Generalized Suffix Trees
Example for two DNA sequences: T and T’=reverse(complement(T))
T = AATGTT
T’ = AACATT
Steps
1. Create SuffixTree(T$) using Ukkonen’s algorithm.Keep suffix links.
2. Add “T:” to all leaf labels (designate current labels)
3. Traverse SuffixTree(T$) using the prefix of T’The stoppage point is new active point
4. Use Ukkonen’s algorithm to insert the remainder of T’4.1. Label leaves “T’: [x, ∞]”4.2. modification: traverse to existing leaves to leave a label
T = AATGTT T’ = AACATT
Tree Trie
ε
┴
ε
┴
T = AATGTT T’ = AACATT
Tree Trie
A
AA
AAT
AATG
AATGT
AATGTT
ε
┴
ε
┴
Step 1: insert fist string
T
AT
ATG
TG
G
ATGT
TGT
GT
ATGTT
TGTT
GTT
TT
T = AATGTT T’ = AACATT
Tree Trie
A
AA
AAT
AATG
AATGT
AATGTT
ε
┴
ε
┴
Step 1: insert fist string
T
AT
ATG
TG
G
ATGT
TGT
GT
ATGTT
TGTT
GTT
TT
Last boundary path
- Last active point
TA
T = AATGTT T’ = AACATT
Tree Trie
A
AA
AAT
AATG
AATGT
AATGTT
ε
┴
ε
┴
Step 1: insert fist string
T
AT
ATG
TG
G
ATGT
TGT
GT
ATGTT
TGTT
GTT
TT
Last boundary path
- Last active point
2,∞
A
3,∞ 4,∞
T
4 ,∞
G
6,∞
T G
Last active point
TA
T = AATGTT$ T’ = AACATT
Tree Trie
A
AA
AAT
AATG
AATGT
AATGTT
ε
┴
ε
┴
Step 1: insert fist stringStep 1.5: finish the tree
T
AT
ATG
TG
G
ATGT
TGT
GT
ATGTT
TGTT
GTT
TT
Last boundary path
- Last active point
2,∞
A
3,∞ 4,∞
T
4 ,∞
G
6,∞
T
G
Last active point
7,∞
$
7,∞$
TA
T = AATGTT$ T’ = AACATT
Tree Trie
A
AA
AAT
AATG
AATGT
AATGTT
ε
┴
ε
┴
Step 2: Traverse the prefix of T’
T
AT
ATG
TG
G
ATGT
TGT
GT
ATGTT
TGTT
GTT
TT
Last boundary path
- Last active point
2,∞
A
3,∞ 4,∞
T
4 ,∞
G
6,∞
T
G
7,∞
$
7,∞$
New active point
TA
T = AATGTT$ T’ = AACATT
Tree Trie
A
AA
AAT
AATG
AATGT
AATGTT
ε
┴
ε
┴
Step 2: Traverse the prefix of T’Step 3: Start inserting the rest of T’
T
AT
ATG
TG
G
ATGT
TGT
GT
ATGTT
TGTT
GTT
TT
- active point
2,∞
A
3,∞ 4,∞
T
4 ,∞
G
6,∞
T
G
7,∞
$
7,∞$
AAC
AC
C
TA
T = AATGTT$ T’ = AACATT
Tree Trie
A
AA
AAT
AATG
AATGT
AATGTT
ε
┴
ε
┴
Step 2: Traverse the prefix of T’Step 3: Start inserting the rest of T’
T
AT
ATG
TG
G
ATGT
TGT
GT
ATGTT
TGTT
GTT
TT
- active point
T:2,∞
A
T:3,∞ T:4,∞
T
T:4,∞
G
T:6,∞
T
G
T:7,∞
$
T:7,∞$
AAC
AC
C
Make leaf nodes “generalized”
TA
T = AATGTT$ T’ = AACATT
Tree Trie
A
AA
AAT
AATG
AATGT
AATGTT
ε
┴
ε
┴
Step 2: Traverse the prefix of T’Step 3: Start inserting the rest of T’
T
AT
ATG
TG
G
ATGT
TGT
GT
ATGTT
TGTT
GTT
TT
- active point
T:2,∞
A
T:3,∞ T:4,∞
T
T:4,∞
G
T:6,∞
T
G
T:7,∞
$
T:7,∞$
AAC
AC
C
T’:3,∞
C
TT’:3,∞
C
T’:3,∞ C
TA
T = AATGTT$ T’ = AACATT
Tree Trie
A
AA
AAT
AATG
AATGT
AATGTT
ε
┴
ε
┴
Step 2: Traverse the prefix of T’Step 3: Start inserting the rest of T’
T
AT
ATG
TG
G
ATGT
TGT
GT
ATGTT
TGTT
GTT
TT
- active point
T:2,∞
A
T:3,∞ T:4,∞
T
T:4,∞
G
T:6,∞
T
G
T:7,∞
$
T:7,∞$
AAC
AC
C
T’:3,∞
C
TT’:3,∞
C
T’:3,∞ C
AACA
ACA
CA
- end point
Nothing to do!
TA
T = AATGTT$ T’ = AACATT
Tree Trie
A
AA
AAT
AATG
AATGT
AATGTT
ε
┴
ε
┴
Step 2: Traverse the prefix of T’Step 3: Start inserting the rest of T’
T
AT
ATG
TG
G
ATGT
TGT
GT
ATGTT
TGTT
GTT
TT
- active point
T:2,∞
A
T:3,∞ T:4,∞
T
T:4,∞
G
T:6,∞
T
G
T:7,∞
$
T:7,∞$
AAC
AC
C
T’:3,∞
C
TT’:3,∞
C
T’:3,∞ C
AACA
ACA
CA
- end point
AACAT
ACAT
CAT
Nothing to do!
TA
T = AATGTT$ T’ = AACATT
Tree Trie
A
AA
AAT
AATG
AATGT
AATGTT
ε
┴
ε
┴
Step 2: Traverse the prefix of T’Step 3: Start inserting the rest of T’
T
AT
ATG
TG
G
ATGT
TGT
GT
ATGTT
TGTT
GTT
TT
- active point
T:2,∞
A
T:3,∞ T:4,∞
T
T:4,∞
G
T:6,∞
T
G
T:7,∞
$
T:7,∞$
AAC
AC
C
T’:3,∞
C
TT’:3,∞
C
T’:3,∞ C
AACA
ACA
CA
- end point
AACAT
ACAT
CAT
ATT
G
T’:6,∞
T
TA
T = AATGTT$ T’ = AACATT
Tree Trie
A
AA
AAT
AATG
AATGT
AATGTT
ε
┴
ε
┴
Step 2: Traverse the prefix of T’Step 3: Start inserting the rest of T’
T
AT
ATG
TG
G
ATGT
TGT
GT
ATGTT
TGTT
GTT
TT
- active point
T:2,∞
A
T:3,∞ T:4,∞
T
T:4,∞
G
T:6,∞
T
G
T:7,∞
$
T:7,∞$
AAC
AC
C
T’:3,∞
C
TT’:3,∞
C
T’:3,∞ C
AACA
ACA
CA
- end point
AACAT
ACAT
CAT
ATT
G
T’:6,∞
T
Crucial bit coming!
T’:6,∞