yutu liu -- cpsc 689 algorithmic techniques for biology spring 2005
DESCRIPTION
A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic of China Bioinformatics, Vol. 21 no. 1 2005. Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005. Motivation. - PowerPoint PPT PresentationTRANSCRIPT
A memory-efficient algorithm for multiple sequence alignment with
constraints
Chin Lung Lu and Yen Pin Huang
National Chiao Tung UniversityTaiwan, Republic of China
Bioinformatics, Vol. 21 no. 1 2005
Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005
2
Motivation
Incorporate the biological structures and consensuses into sequence alignment
Memory efficient
3
Problem Formulation -- Constraints
What is the multiple sequence alignment with constraints ?
A T C T C G C T
T G C A T A T
AT T
A T -- C -- T C G C T
-- T G C A T -- -- A T
-- -- -- A T C T C G C T
T G C A T A T -- -- -- --
1C 2C
Conserved sites of a protein
or DNA/RNA family
),...,,( 21 CCC
iiii i
cccC ...21
No overlapping between them
CCC ...21
4
Problem Formulation -- Constraints
T G C A T A T
G A
iC
jS
2i
jS
iii ccC 21
1),( ij CS
Hamming Distance
0.5
1 i
Approximately appears
5
Given S={s1,s2,…,sx}, and
Problem Formulation -- Constraints
A T G C A T C G C T
-- T G C A T -- -- A T
T T G C A T C A T C
L
Subseq(S2, L’)
Band L’
T G C C C
string T={t1,t2,..tk}, for
T approximately appears in L
*)),',(( kTLSsubseq i ,1 xi
6
Problem Formulation
1 2 x 1 2Let S={S ,S ,...,S } over the alphabet . Let =(C ,C ,...,C )
be an ordered set of constraints. Then the CMSA of S w.r.t is an
alignment L of S over {-} with the optimal sum-of-pair score
(SP sco
'1 2 i
'
re) in which all the constraints of approximately appear in
the order of C C ... such that (subseq(S , ), )
for all 1 i x and 1 j , where is the band of L whose induced
consensus i
j j j
j
C L C
L
js C .
Constrained Multiple Sequence Alignment (CMSA)
S1
S2
S3
C1 C2 C3
Optimal Sum-of-Pair Score
CPSA
7
CMSA
Pick two sequencesFind the CPSAUse it as a kernel to progressively
align more sequences
[1] Progressive Multiple Alignment with Constraints, Gene Myers et al. [2] MuSiC: A Tool for Multiple Sequence Alignment with Constraints Yin Te Tsai Chin Lung Lu Ching Ta Yu Yen Pin Huang
8
Algorithm Overview
bj-1
ai-1
Find recursive relationship
ai
bj
M(i-1,j-1)
M(i,j)
Divide-and-Conquer
9
Notations
1 2
1 2
1 2
1 2
...
...
( , ) ...
( , ) ...
m
n
i i
i i m i
A a a a
B bb b
pref A i a a a A
suff A i a a a A
1 2 1 2..... .....i i i ma a a a a a
iA iA
10
Notations
1 2
1 2
1 2
1 2
( , j) ...
( , j) ...
( , k) ...
( , k) ...
j j
j j n j
k k
k k k
pref B b b b B
suff B b b b B
pref C C C
suff C C C
11
Notation ( , )
k
DM i j
( , ) kM i j
( , ) K
IM i j
jBjB
iA iA1 2 1 1............ ..........i i m ma a a a a a
1 2 1 1........... .........j j n nbb b b b b
CkC1 Cγ … …
( , ) k
SM i j
( , ) k
DM i j
( , ) kM i j
( , ) K
IM i j
( , ) k
SM i j
( , ) kM i j
12
Alignment Score
Let ( , ) be the score of an optimal constrained
alignment of and w.r.t k
i j k
M i j
A B
A
B ...C1 C2 Ck
13
Alignment Score - Substitution
Let ( , ) be the maximum score of all constrained
alignments of and w.r.t that end with a substitution
pair ( , ).
Sk
i j k
i j
M i j
A B
a b
A
B ...C1 C2 Ck
ai
bj
14
Alignment Score -- Deletion
i
i
Let ( , ) be the maximum scores of all constrained
alignment of A and w.r.t. that end with a deletion
pair (a , -).
Dk
j k
M i j
B
--
ai
A
B...
C1 C2 Ck
15
Alignment Score -- Insertion
i
j
Let ( , ) be the maximum scores of all constrained
alignment of A and w.r.t. that end with a insertion
pair (-,b ).
Ik
j k
M i j
B
--
b j
A
B...
C1 C2 Ck
16
Semi-Constrained Alignment
k-1
k
A semi-constrained alignment of and w.r.t
a constrained alignment and w.r.t , and end
with a band which is a prefix of C
i j k
i j
A B is
A B
A
B ...C1 C2 Ck-1 Ck
( , , )kN i j hh
( , )kpref C h
17
Recurrence of Scores
k
if 0, then
( 1, 1) ( , )
M ( , ) max ( , )
( , )
k i jDkIk
k
M i j a b
i j M i j
M i j
18
Recurrence of Scores
k
if 1 , then
( 1, 1) ( , )
( , )M ( , ) max
( , )
( , , )
k i jDkIk
k k
k
M i j a b
M i ji j
M i j
N i j
19
Recurrence of Scores
1 0 1
If ( ( , ), ) and ( ( , ), )
( , , ) ( , ) ( , )
( , , )
k
i k k k j k k k
k k k k k i h j hh
k k
suff A C suff B C
N i j M i j a b
else
N i j
1 2 1............ ..........i h i ia a a a a
1 2 1........... .........j h j jbb b b b
Ck
20
? ( 1, )kM i j
1 2 1...................... i ia a a a
1 2 1..................... j jbb b b
( , )DkM i j
D Sk k
D Dk k
D Ik k
Substitution: M ( , ) M ( 1, )
Deletion: M ( , ) M ( 1, )
Insertion: M ( , ) M ( 1, )
o e
e
o e
i j i j w w
i j i j w
i j i j w w
a i-1
b j
--
b j
a i-1
--
21
Recurrence
Sk
D Ik k
Dk
M ( 1, )
M ( , ) max M ( 1, )
M ( 1, )
o e
o e
e
i j w w
i j i j w w
i j w
DkM ( 1, ) o ei j w w
kM ( 1, ) o ei j w w
kDk D
k
M ( 1, )M ( , ) max
M ( 1, )o e
e
i j w wi j
i j w
kIk I
k
M ( 1, )M ( , ) max
M ( 1, )o e
e
i j w wi j
i j w
22
( i, j, k)( i, j-1, k)
( i-1, j-1, k) ( i-1, j, k)
Sequence B
Sequence A
Constraints
( m, n, γ )
( 0, 0, 0)
23
1 0 1( , , ) ( , ) ( , )
kk k k k k i h j hh
N i j M i j a b
Nk
24
Assignment
Design an algorithm to find the CPSA using dynamic programming
technique. Analyze the time and space complexity of your algorithm. For
simpilicity, you can ignore the open-gap penalty. Prove your algorithm
is consistent with the constrained set . . . it will find such a CPSA if
there exists one.
i e
Email: [email protected]
25
Divide-and-Conquer
26
( , , )kN i j h( , , )kN i j h
1 2 1( ) [ , ,..., , ( , )]k k kh c c c pref c h 1( ) [ ( , ), ,..., , ]k k k kh suff c h c c
( , )IkM i j( , )I
kM i j
( , )SkM i j ( , )S
kM i j
( , )kM i j ( , )kM i j
( , )DkM i j( , )D
kM i j
jBjB
iA iA1 2 1 1............ ..........i i m ma a a a a a
1 2 1 1........... .........j j n nbb b b b b
h
pref(Ck,h) suff(Ck, λk - h)
CkC1 Cγ … …
27
Divide-and-Conquer
( , )mid mid midk i jL A B ( , )
mid mid midk i jL A B
Case 1: if the last pair of ( , ) is a substitution
A. ( , ) and ( , ) are optimal constrained
( , ) ( , ) ( , )
B.
mid mid mid
mid mid mid mid mid mid
k midmid
k i j
k i j k i j
Smid mid k mid mid
L A B
L A B L A B
M m n M i j M i j
L
( , ) and ( , ) are optimal semi-constrained
( , ) ( , , ) ( , , )
C. if
( , ) ( , , ) (
mid mid mid mid mid mid
mid mid
mid
mid mid mid
k i j k i j
k mid mid mid k mid mid mid
mid k
k mid mid k k m
A B L A B
M m n N i j h N i j h
h
M m n N i j M i
, ) id midj
28
Divide-and-Conquer
( , )mid mid midk i jL A B ( , )
mid mid midk i jL A B
Case 2: if the last pair of ( , ) is a deletion
A. If the first pair of ( , ) is not a deletion pair
( , ) max{ ( , ) ( , ),
mid mid mid
mid mid mid
k midmid
k i j
k i j
D Smid mid k mid mid
L A B
L A B
M m n M i j M i j
( , ) ( , )}
B. If the first pair of ( , ) is a deletion pair
( , ) ( , ) ( , )
k midmid
mid mid mid
k midmid
D Imid mid k mid mid
k i j
D Dmid mid k mid mid o
M i j M i j
L A B
M m n M i j M i j w
29
Summary( , )
midk mid midM i j
( , ) ( , )
( , ) ( , )
( , ) ( , )( , ) max
( , ) ( , )
( , , ) ( , , )
K midmid
K midmid
K midmid
K midmid
mid mid
mi
D Imid mid k mid mid
D Smid mid k mid mid
Smid mid k mid mid
D Dmid mid k mid mid o
k mid mid mid k mid mid mid
k
M i j M i j
M i j M i j
M i j M i jM m n
M i j M i j w
N i j h N i j h
N
( , , ) ( , )d mid midmid mid k k mid midi j M i j
( , ) ( , )K midmid
D Dmid mid k mid midM i j M i j
( , )midk mid midM i j( , )
Kmid
Dmid midM i j
30
Take , , as indices , and , where 1 ,
0 and 1 , such that the following maximal value
is the maximum.
( , ) ( , )
( , ) ( , )
( , ) max ( , ) (
K
K
K
mid mid mid
k
Dmid k mid
Smid k mid
D Dmid k mi
j k h j k h j n
k h
M i j M i j
M i j M i j
M m n M i j M i
, )
( , , ) ( , , )
( , , ) ( , )
d o
k mid k mid
k mid k k mid
j w
N i j h N i j h
N i j M i j
Summary
( , ) { ( , , , )}midM m n Max F i j k h
2mid
mi
31
, , ,Algorithm CPSA-DC( , , )
1. Divide A into 2, then call BestScore() and BestScoreRev(), where
the sizes of B and 's are not changed.
2. The BestScore() and BestScoreRev
start end start end start endi i j j k k
mid midi i
{ ( , , , )
() return all the alignment scores
of (A , , ) (A , , )
3 Find the where the value of j, k, h will be used as
the middle point to divide the alignment for recur
}
smi
k
d
j k jB an
max F i k h
B
j
d
ive call of
CPSA-DC()
Implementation -- CPSA-DC()
32
( , ) kM i j
jB
jB
1 2 1 1............ ..........i i m ma a a a a a
1 2 1 1........... .........j j n nbb b b b b
CkC1 Cγ … …
midiA
33
Complexity
k ,
A single matrix E of size ( +1)(n+1) with each entry of
4 for ( , ), ( ), ( , ), ( , )
and ( , , )
Temporary Space: V is the same size as E
Total Space: ( )
Let , the s
s D Ik mid k mid k mid k mid
k mid
M i j M i j M i j M i j
N i j h
n
mn
ize of the original problem,
then the total time complexity of CPSA-DC algorithm is
equal to ... 22 4 8
34
Experimental Results
35
Experimental Results
36
Discussion
Lack of proof of consistency of constraints
Optimal pair-wise subsequences alignment might cause the failure of the overall optimal alignment
37
Discussion
http://genome.life.nctu.edu.tw:8080/MUSICME/index.html
38
Assignment
1 2 1 2
1 2
Let { , } over the alphabet . Let ( , ,..., )
be an ordered set of constraints, where ... . Then
the Constrained Pair-wise Sequence Alignment (CPSA) of S
w.r.t is an alignment
i
i i ii
S S S C C C
C c c c
1 2
'
L of S over {-} with the optimal
sum-of-pair score (SP score) in which all the constraints of
approximately appear in the order of ... such
that the hamming distance ( ( , ), )i j j
C C C
subseq S L C
'
for
all 1 2 and 1 , where 0 1, and is the band
of L whose induced consensus is . A band is a block of
consecutive columns in L.
Design an efficient algorithm to find the CPSA using
j
j
j
i j L
C
dynamic
programming technique or whatever method you prefer. For
simpilicity, you can ignore the open-gap penalty. Analyze
the time and space complexity of your algorithm. Prove your
algorithm is consistent with the constraint set . . . it will find
such a CPSA if there exists one.
i e
39
Reference
Efficient Constrained Multiple Sequence Alignmentwith Performance GuaranteeFrancis Y.L. Chin N.L. Ho T.W. Lamy Prudence W.H. Wong M.Y. Chan
Divide-and-conquer multiple alignment withsegment-based constraintsMichael Sammeth1,∗, Burkhard Morgenstern2 and Jens Stoye 1
Multiple sequence alignment with the divide-and-conquer methodJens Stoye
MuSiC: A Tool for Multiple Sequence Alignment with Constraints Yin Te Tsai1 Chin Lung Lu2 ∗ Ching Ta Yu1 Yen Pin Huang