bioinformatics phd. course 1. biological introduction exact extended approximate 6. projects: promo,...
DESCRIPTION
Dynamic programming What about genomes? Quadratic cost of space and time. accaccacaccacaacgagcata … acctgagcgatat acc..tacc..t Short sequences (up to bps) can be aligned using dynamic programming Quadratic cost of space and time. acc agt | | | |xx acc a--TRANSCRIPT
Bioinformatics PhD. Course
1. Biological introduction
Exact Extended Approximate
6. Projects: PROMO, MREPATT, …
5. Sequence assembly
2. Comparison of short sequences ( up to 10.000bps) Dot Matrix Pairwise align. Multiple align. Hash alg.
3. Comparison of large sequences ( more that 10.000bps) Data structures Suffix trees MUMs
4. String matching
Comparison of large sequences
First part:
Alignment of large sequences
Dynamic programming
What about genomes?
• Quadratic cost of space and time.
accaccacaccacaacgagcata … acctgagcgatat
acc..t
• Short sequences (up to 10.000 bps) can be aligned using dynamic programming
• Quadratic cost of space and time.
acc.................................agt | | |.................................|xxacc.................................a--
Genomic sequences
In which case Dynamic Programming can be applied?
•The length of sequences is 1000 times longer.
• Genomic sequences have millions of base pairs.
•The running time is 1.000.000 times higher !
(1 second becomes 11 days)(1 minute becomes 2 years)
First assumption
……………………………………………………………….
………………………….………………...…………...….
Genome B
Genome A
……………………………………Genome B
……
……
……
……
……
….
Gen
ome
A
Realistic assumption?
Unrealistic assumption!
More realistic
assumption
……………………………………………………………….
………………………….………………...…………...….
Genome B
Genome A
………………………………………………………………….
………………………………………………...…………...….Genome A
Genome B
………………………
……
……G
enom
e A
Genome B
Realistic assumptions?
But, now is it a
real case?
Unrealistic assumption!
More realistic
assumption
……………………………………………………………….
………………………….………………...…………...….
Genome B
Genome A
…………………………………………………………………
………………………………………………...…………...….Genome A
Genome B
………………………
……
……G
enom
e A
Genome B
Preview in a real caseChlamidia muridarum: 1.084.689bps Chlamidia Thrachomatis:1057413bps
Preview in a real case
Pyrococcus abyssis: 1.790.334 bpsPyrococcus horikoshu: 1.763.341 bps
Methodology of an alignment
1st:
2nd:
3th: (Linear cost)
Identify the portions that can be aligned.
Make a preview: ……………………..….
…………………...….
Make the alignment:
…..…
……
………………….
(Linear cost)
Methodology of an alignment
(Linear cost)
Make a preview: ……………………..….
…………………...….1st:
2nd:
3th:
Identify the portions that can be aligned.
Make the alignment:
…..…
……
………………….
?
Preview-Revisited
… a a t g….c t g...
… c g t g….c c c ...
MatchingUniqueMaximal
MUMConnect to MALGEN
Methodology of an alignment
1st:
2nd:
3th:
Identify the portions that can be aligned.
Make a preview: ……………………..….
…………………...….
Make the alignment:
…..…
……
………………….
How can MUMs be found?
With CLUSTALW, TCOFFEE,…
How can these portions be determined?
Linear costwith
Suffix trees
Comparison of large sequences
M-GCAT
Todd Treangen
Homework
1. Javier 14. Alexis2. Dmitry 15. Ramon3. Ana Iris4. David5. Patricia6. Rogeli7. Atif8. Aina9. Isaac10. Maria Merce11. Romina12. Guillem13. Raul
Bioinformatics PhD. Course
Second part:
Introducing Suffix trees
Suffix trees
Given string ababaas:
1: ababaas
2: babaas
3: abaas
4: baas
5: aas
6: as7: s
as,3
s,6
as,5
s,7
as,4ba
baas,2
a
babaas,1
a
babaas,1
ba
baas,2
as,3
as,4
s,6
as,5
s,7
Suffixes:
What kind of queries?
Applications of Suffix trees
a
babaas,1as,3
ba
baas,2
as,4
s,6
as,5
s,7
1. Exact string matching
…………………………
• Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab?
Quadratic insertion algorithm
Given the string …………………………......
P1: the leaves of suffixes from have been inserted
and the suffix-tree
…...
Invariant Properties:
Quadratic insertion algorithm
Given the string ababaabbs
ababaabbs,1
Quadratic insertion algorithm
Given the string ababaabbs
babaabbs,2
ababaabbs,1
Quadratic insertion algorithm
Given the string ababaabbs
babaabbs,2
ababaabbs,1ababaabbs,1
Quadratic insertion algorithm
Given the string ababaabbs
babaabbs,2
ababaabbs,1
abbs,3
Quadratic insertion algorithm
Given the string ababaabbs
babaabbs,2
ababaabbs,1
abbs,3
ba
baabbs,2
Quadratic insertion algorithm
Given the string ababaabbs
ababaabbs,1
abbs,3
ba
baabbs,2
abbs,4
Quadratic insertion algorithm
Given the string ababaabbs
ababaabbs,1
abbs,3
abbs,4ba
baabbs,2
abbs,4
abbs,3ba
a
baabbs,1
Quadratic insertion algorithm
Given the string ababaabbs
abbs,4ba
baabbs,2
abbs,4
abbs,3ba
a
baabbs,1
abbs,5
Quadratic insertion algorithm
Given the string ababaabbs
abbs,4ba
baabbs,2
abbs,4
abbs,3ba
a
baabbs,1
abbs,5
Quadratic insertion algorithm
Given the string ababaabbs
abbs,4
ba
ba
baabbs,2
abbs,4
a abbs,5
ba abbs,3
baabbs,1
Quadratic insertion algorithm
Given the string ababaabbs
abbs,4ba
baabbs,2
abbs,4
a abbs,5
ba abbs,3
baabbs,1
bs,6
Quadratic insertion algorithm
Given the string ababaabbs
abbs,4ba
baabbs,2
abbs,4
a abbs,5
ba abbs,3
baabbs,1
bs,6
Quadratic insertion algorithm
Given the string ababaabbs
a abbs,5
ba abbs,3
baabbs,1
bs,6
a
baabbs,2
b
abbs,4
bs,7
Quadratic insertion algorithm
Given the string ababaabbs
a abbs,5
ba abbs,3
baabbs,1
bs,6
a
baabbs,2
b
abbs,4
bs,7
s,8
Quadratic insertion algorithm
Given the string ababaabbs
a abbs,5
ba abbs,3
baabbs,1
bs,6
a
baabbs,2
b
abbs,4
bs,7
s,7
s,9
Generalizad suffix tree
The suffix tree of many strings …
and it is the suffix tree of the concatenation of strings.
the generalized suffix tree of ababaabb and aabaat …
is the suffix tree of ababaabαaabaatβ, :
is called the generalized suffix tree …
For instance,
Generalizad suffix tree
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given the suffix tree of ababaabα :
Construction of the suffix tree of ababaabbαaabaaβ :
Generalizad suffix tree
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Construction of the suffix tree of ababaabbαaabaaβ :
Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ :
a bα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
abaaβ,1
Generalizad suffix tree
Construction of the suffix tree of ababaabbαaabaaβ :
a bα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
abaaβ,1
Generalizad suffix tree
a bα,5
ba bbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
abaaβ,1
aβ,2
Construction of the suffix tree of ababaabbαaabaaβ :
Generalizad suffix tree
a bα,5
ba bbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
abaaβ,1
aβ,2
Construction of the suffix tree of ababaabbαaabaaβ :
Construction of the suffix tree of ababaabbαaabaaβ :
Generalizad suffix tree
a bα,5
ba bbα,3
baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
abaaβ,1
aβ,2
a β,3
Construction of the suffix tree of ababaabbαaabaaβ :
Generalizad suffix tree
a bα,5
ba bbα,3
baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
abaaβ,1
aβ,2
a β,3
Generalizad suffix tree
a bα,5
ba bbα,3
baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
baaβ,1
aβ,2
a β,3
aβ,4
Construction of the suffix tree of ababaabbαaabaaβ :
Generalizad suffix tree
a bα,5
ba bbα,3
baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
baaβ,1
aβ,2
a β,3
aβ,4
Construction of the suffix tree of ababaabbαaabaaβ :
Generalizad suffix tree
a bα,5
ba bbα,3
baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
baaβ,1
aβ,2
a β,3
aβ,4β,5
Construction of the suffix tree of ababaabbαaabaaβ :
Generalizad suffix tree
a bα,5
ba bbα,3
baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
baaβ,1
aβ,2
a β,3
aβ,4β,5
Construction of the suffix tree of ababaabbαaabaaβ :
Generalizad suffix tree
a bα,5
ba bbα,3
baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
baaβ,1
aβ,2
a β,3
aβ,4β,5
β,6
Construction of the suffix tree of ababaabbαaabaaβ :
Generalizad suffix tree
a bα,5b
a bbα,3baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
baaβ,1
aβ,2
a β,3
aβ,4β,5
β,6
Generalized suffix tree of ababaabbαaabaaβ :
Applications of Suffix trees
a
babaas,1as,3
ba
baas,2
as,4
s,6
as,5
s,7
1. Exact string matching
…………………………
• Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab?
Applications of Suffix trees
2. The substring problem for a database of strings DB• Does the DB contain any ocurrence of patterns abab, aab, and ab?
a bα,5b
a bbα,3baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
baaβ,1
aβ,2
a β,3
aβ,4β,5
β,6
Applications of Suffix trees
3. The longest common substring of two strings
a bα,5b
a bbα,3baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
baaβ,1
aβ,2
a β,3
aβ,4β,5
β,6
Applications of Suffix trees
5. Finding MUMs.
a bα,5b
a bbα,3baabbα,1
bα,6
a
baabbα,2
b
bbα,4
bα,7
α,8
α,9
baaβ,1
aβ,2
a β,3
aβ,4β,5
β,6
Bioinformatics PhD. Course
Third part:
Suffix links
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
?
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
?
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
?
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
?
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
?
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
?
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
?
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
?
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
a
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a aa in S2 [1] Unique matchings
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a aa in S2 [1] Unique matchings
aab in S2 [1] =
S1[5..6-7] in S2 [1]
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a Unique matchings S1[5..6-7] in S2 [1]
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a Unique matchings S1[5..6-7] in S2 [1]
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a b b a Unique matchings S1[5..6-7] in S2 [1] S1[3..6-…] in S2 [2]
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a b b a Unique matchings S1[5..6-7] in S2 [1] S1[3..6-…] in S2 [2]
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a b b a Unique matchings S1[5..6-7] in S2 [1] S1[3..6-…] in S2 [2]
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a b b a Unique matchings S1[5..6-7] in S2 [1] S1[3..6-…] in S2 [2]
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a b b a Unique matchings S1[5..6-7] in S2 [1] S1[3..6-8] in S2 [2]
S1[4..6-8] in S2 [3]
Traversal using Suffix links
a abbα,5
ba abbα,3
baabbα,1
bα,6
a
baabbα,2
b
abbα,4
bα,7
α,8
α,9
Given S2 = a a b a a b b a Unique matchings S1[5..8] in S2 [4] S1[3..6-8] in S2 [2] S1[4..6-8] in S2 [3] S1[6..8] in S2 [5] S1[7..8] in S2 [6]
From UMs to MUMs
Given S2 = a a b a a b b a Unique matchings S1[5..8] in S2 [4] S1[3..6-8] in S2 [2] S1[4..6-8] in S2 [3] S1[6..8] in S2 [5] S1[7..8] in S2 [6]
Array of UMs123 6-84 6-85 86 87 889
and S1 = a b a b a a b b α
MUM: S1[3..6-8] in S2[2]
Bioinformatics PhD. Course
Third part:
Linear insertion algorithm
Quadratic insertion algorithm
Given the string …………………………......
P1: the leaves of suffixes from have been inserted
and the suffix-tree
…...
Invariant Properties:
Linear insertion algorithm
Given the string …………………………......
P2: the string is the longest string that can be spelt through the tree.
P1: the leaves of suffixes from have been inserted
and the suffix-tree
…...
Invariant Properties:
Linear insertion algorithm: example
Given the string ababaababb...
ba
baababb...,2
a ababb...,5
ba ababb...,3
baababb...,1ababb...,4
a
Linear insertion algorithm: example
Given the string ababaababb...
ba
baababb...,2
a ababb...,5
ba ababb...,3
baababb...,1ababb...,4
6 7 8
Linear insertion algorithm: example
ba
baababb...,2
a ababb...,5
ba ababb...,3
baababb...,1ababb...,4
6 7 8Given the string ababaababb...
Linear insertion algorithm: example
ba
baababb...,2
a ababb...,5
ba ababb...,3
baababb...,1ababb...,4
6 7 89Given the string ababaababb...
Linear insertion algorithm: example
a ababb...,5
ba ababb...,3
baababb...,1ba
baababb...,2
ababb...,4
Given the string ababaababb...
6 7 89
baababb...,1b
b...,6
aababb...,1
Linear insertion algorithm: example
a ababb...,5
ba ababb...,3
ba
baababb...,2
ababb...,4
Given the string ababaababb...
7 89
b
b...,6
aababb...,1
Linear insertion algorithm: example
a ababb...,5
ba ababb...,3
ba
baababb...,2
ababb...,4
Given the string ababaababb...
7 89
b
b...,6
aababb...,1
Linear insertion algorithm: example
a ababb...,5
ba ababb...,3
ba
baababb...,2
ababb...,4
Given the string ababaababb...
7 89
b
b...,6
aababb...,1
Linear insertion algorithm: example
a ababb...,5
ba ababb...,3
ba
baababb...,2
ababb...,4
Given the string ababaababb...
7 89
b
b...,6
aababb...,1
baababb...,2b aababb...,2
Linear insertion algorithm: example
a ababb...,5
ba ababb...,3
ba
baababb...,2
ababb...,4
Given the string ababaababb...
7 8…
b
b...,6
aababb...,1
baababb...,2b
b...,7
aababb...,2
Linear insertion algorithm: example
a ababb...,5
ba ababb...,3
ba ababb...,4
Given the string ababaababb...
89
b
b...,6
aababb...,1
b
b...,7
aababb...,2
Linear insertion algorithm: example
a ababb...,5
ba ababb...,3
ba ababb...,4
Given the string ababaababb...
89
b
b...,6
aababb...,1
b
b...,7
aababb...,2
Linear insertion algorithm: example
a ababb...,5
ba ababb...,3
ba ababb...,4
Given the string ababaababb...
89
b
b...,6
aababb...,1
b
b...,7
aababb...,2
Linear insertion algorithm: example
a ababb...,5
ba ababb...,3
ba ababb...,4
Given the string ababaababb...
89
b
b...,6
aababb...,1
b
b...,7
aababb...,2
Linear insertion algorithm: example
a ababb...,5
b
ba ababb...,4
Given the string ababaababb...
89
ababb...,3
b
b...,6
aababb...,1
b
b...,7
aababb...,2
a
Linear insertion algorithm: example
a ababb...,5
b
ba ababb...,4
Given the string ababaababb...
89
ababb...,3
b
b...,6
aababb...,1
b
b...,7
aababb...,2
a
b...,8
Linear insertion algorithm: example
a ababb...,5
b
ba ababb...,4
Given the string ababaababb...
9
ababb...,3
b
b...,6
aababb...,1
b
b...,7
aababb...,2
a
b...,8
Linear insertion algorithm: example
a ababb...,5
b
ba ababb...,4
Given the string ababaababb...
9
ababb...,3
b
b...,6
aababb...,1
b
b...,7
aababb...,2
a
b...,8
Linear insertion algorithm: example
a ababb...,5
b
b ababb...,4
Given the string ababaababb... 9
ababb...,3
b
b...,6
aababb...,1
b
b...,7
aababb...,2
a
b...,8a
Linear insertion algorithm: example
a ababb...,5
b
b ababb...,4
Given the string ababaababb... 9
ababb...,3
b
b...,6
aababb...,1
b
b...,7
aababb...,2
a
b...,8a
b...,9
Linear insertion algorithm: example
a ababb...,5
b
b ababb...,4
Given the string ababaababb... 9
ababb...,3
b
b...,6
ababb...,1
b
b...,7
aababb...,2
a
b...,8a
b...,9
Linear insertion algorithm: example
a ababb...,5
b
b ababb...,4
Given the string ababaababb... 9
ababb...,3
b
b...,6
ababb...,1
b
b...,7
aababb...,2
a
b...,8a
b...,9
Linear insertion algorithm: example
a ababb...,5
b
b ababb...,4
Given the string ababaababb...
9
ababb...,3
b
b...,6
ababb...,1
b
b...,7
aababb...,2
a
b...,8a
b...,9
Index
Suffix arrays Suffix-arrays: a new method for on-line
string searches, G. Myers, U. Manber
Suffix arrays
Given string ababaa#:
1: ababaa#2: babaa#
3: abaa#
4: baa#
5: aa#
6: a#
7: #
Suffixes: … but lexicographically sorted
1: ababaa#
2: babaa#
3: abaa#
4: baa#
5: aa#6: a#1: #1
234567
Which is the cost? O(n log(n))
Applications of suffix arrays
1. Exact string matching• Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab?
1: ababaa#
2: babaa#
3: abaa#
4: baa#
5: aa#6: a#1: #1
234567
Binary search
O(log(n) |P|)
… which is the cost?
O(log(n)+|P|) ?
Can it be improved to …
Fast search with cost O(log(n)+|P|) Query:
Invariant Properties:
P1: α < query ≤ β α
β
12… …
n
Suffix array
P2: matches pref( query)
Fast search with cost O(log(n)+|P|) Query:
Invariant Properties:
P1: α < query ≤ β α
β
γ Algorithm:
12… …
n
Suffix array
P2: matches pref( query)
If suff(γ)<suff(query) then α = γ else β = γ