overview of pairwise sequence alignment
DESCRIPTION
Overview of Pairwise Sequence Alignment. 報告者:林哲鋒. Dynamic Programming Applied to optimization problems Useful when Problem can be recursively divided into sub-problems Sub-problems are not independent - PowerPoint PPT PresentationTRANSCRIPT
1
Overview of Pairwise Sequence Alignment
• Dynamic Programming– Applied to optimization problems
– Useful when• Problem can be recursively divided into sub-problems• Sub-problems are not independent
• Needleman-Wunsch is a global alignment technique that uses an iterative algorithm and no gap penalty (could extend to fixed gap penalty).
• Smith-Waterman is a local alignment technique that uses a recursive algorithm. Smith-Waterman’s algorithm is an extension of Longest Common Substring (LCS) problem and can be generalized to solve both local and global alignment.
報告者:林哲鋒
2
「最長共同子序列」 (LCS, Longest Common Subsequence) 問題
• 首先我們先解釋什麼是子序列 (subsequence) ,所謂子序列就是將一個序列中的一些( 可能是零個 ) 字元去掉所得到的序列,例如: pred 、 sdn 、 predent 等都是 ” president” 的子序列。
• 給定兩序列,最長共同子序列 (LCS) 問題是決定一個子序列,使得 (1) 該子序列是這兩序列的子序列; (2) 它的長度是最長的。
3
LCS
例如:
序列一: president
序列二: providence它的一個 LCS 為 priden ( PResIDENt PRovIDENce )
4
LCS
又例如:
序列一: algorithm
序列二: alignment它的一個 LCS 為 algm or algt ( ALGorithM ALiGnMent )
5
How to compute LCS?
• 給定兩序列及,令 len(i, j) 表示 LCS 之長度,則下列遞迴關係可用來計算 len(i, j) :
,
. and 0, if)),1(),1,(max(
and 0, if1)1,1(
,0or 0 if0
),(
ji
ji
bajijilenjilen
bajijilen
ji
jilen
6
p r o c e d u r e L C S - L e n g t h ( A , B )
1 . f o r i ← 0 t o m d o l e n ( i , 0 ) = 0
2 . f o r j ← 1 t o n d o l e n ( 0 , j ) = 0
3 . f o r i ← 1 t o m d o
4 . f o r j ← 1 t o n d o
5 . i f ji ba
t h e n
" "),(
1)1,1(),(
jiprev
jilenjilen
6 . e l s e i f )1,(),1( jilenjilen
7 . t h e n
" "),(
),1(),(
jiprev
jilenjilen
8 . e l s e
" "),(
)1,(),(
jiprev
jilenjilen
9 . r e t u r n l e n a n d p r e v
insertion
deletion
7
i j 0 1 p
2 r
3 o
4 v
5 i
6 d
7 e
8 n
9 c
10 e
0 0 0 0 0 0 0 0 0 0 0 0
1 p 2
0 1 1 1 1 1 1 1 1 1 1
2 r 0 1 2 2 2 2 2 2 2 2 2
3 e 0 1 2 2 2 2 2 3 3 3 3
4 s 0 1 2 2 2 2 2 3 3 3 3
5 i 0 1 2 2 2 3 3 3 3 3 3
6 d 0 1 2 2 2 3 4 4 4 4 4
7 e 0 1 2 2 2 3 4 5 5 5 5
8 n 0 1 2 2 2 3 4 5 6 6 6
9 t 0 1 2 2 2 3 4 5 6 6 6
圖: 以LCS-Length計算president與providence的LCS。
8
p r o c e d u r e O u tp u t - L C S (A , p r e v , i , j )
1 i f i = 0 o r j = 0 t h e n r e t u r n
2 i f p r e v ( i , j ) = ” “ t h e n
ia
jiprevALCSOutput
)1,1,,(
3 e l s e i f p r e v ( i , j ) = ” “ t h e n O u tp u t - L C S (A , p r e v , i - 1 , j )
4 e l s e O u tp u t - L C S (A , p r e v , i , j - 1 )
9
i j 0 1 p
2 r
3 o
4 v
5 i
6 d
7 e
8 n
9 c
10 e
0 0 0 0 0 0 0 0 0 0 0 0
1 p 2
0 1 1 1 1 1 1 1 1 1 1
2 r 0 1 2 2 2 2 2 2 2 2 2
3 e 0 1 2 2 2 2 2 3 3 3 3
4 s 0 1 2 2 2 2 2 3 3 3 3
5 i 0 1 2 2 2 3 3 3 3 3 3
6 d 0 1 2 2 2 3 4 4 4 4 4
7 e 0 1 2 2 2 3 4 5 5 5 5
8 n 0 1 2 2 2 3 4 5 6 6 6
9 t 0 1 2 2 2 3 4 5 6 6 6
圖: Output-LCS的回溯路線,深色陰影(priden)為LCS
所在。
Output : priden
10
Identification of Common Molecular Subsequences
T. F. SMITE AND M. S. WATERM
J. Mol. Bwl. (1981), 147, 195-197
11
ABSTRACT
• The identification of maximally homologous subsequences among sets of long sequences is an important problem.
• To find a pair of segments, one from each of two long sequences, such that there is no other pair of segments with greater similarity.
12
Algorithm
• two molecular sequences will be A=a1a2 . . . an, and B=b1b2 . . . bm.
• A similarity s(a,b) is given between sequence elements a and b.
• Deletions of length k are given weight Wk
• Set up a matrix H. First set
Hko = Hol = 0 for 0 k n & 0 l m
13
Algorithm cont.
• Hij is the maximum similarity of two segments ending in ai and bj
• These values are obtained from the relationship
14
• (1) If ai and bj are associated, the similarity is
• (2) If ai is at the end of a deletion of length k, the similarity is
• (3) If bj is at the end of a deletion of length I , the similarity is
• (4) Finally, a zero is included to prevent calculated negative similarity, indicating no similarity up to a i and bj
Hij follows by considering the possibilities for ending ,the segments at any ai and bj.
Hi,j-l ─Wl
15
• The pair of segments with maximum similarity is found by first locating the maximum element of H.
• The other matrix elements leading to this maximum value are than sequentially determined with a traceback procedure ending with an element of H equal to zero
16
• in Figure 1.
• A match, ai = bj , s(ai,bj) =1 ,
a mismatch produced a minus one-third.
17
Local VS global alignment
18
Global Alignment vs. Local Alignment
• global alignment:
• local alignment:
19
Global Alignment vs. Local Alignment
),(
),(),(
0
max
1,1
1,
,1
,
jiji
jji
iji
ji
baws
bwsaws
s
),(
),(
),(
max
1,1
1,
,1
,
jiji
jji
iji
ji
baws
bws
aws
s
local global
20
0 0 0 0 0 0 0 0 0
0 8 5 2 0 0 8 5 2
0 5 3 0 0 8 5 3 13
0 2 0 0 0 8 5 2 11
0 0 0 0 8 5 3 13 10
0 0 0 0 8 5 2 11 8
0 8 5 2 5 3 13 10 7
0 5 3 0 2 13 10 8 18
C G G A T C A T
C
T
T
A
A
C
T
A – C - TA T C A T8-3+8-3+8 = 18
Local alignment exampleMatch: 8
Mismatch: -5
Gap symbol: -3
21
global alignment
• Needleman Wunsch(1970)• Three steps in dynamic programming• Initialization • Matrix fill (scoring) • Traceback (alignment
• Match: +8 (w(x, y) = 8, if x = y)• Mismatch: -5 (w(x, y) = -5, if x ≠ y)• Each gap symbol: -3 (w(-,x)=w(x,-)=-3)
22
C T T A A C – TC G G A T C A T
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 8 5 2 -1 -4 -7 -10 -13
-6 5 3 0 -3 7 4 1 -2
-9 2 0 -2 -5 5 -1 -4 9
-12 -1 -3 -5 6 3 0 7 6
-15 -4 -6 -8 3 1 -2 8 5
-18 -7 -9 -11 0 -2 9 6 3
-21 -10 -12 -14 -3 8 6 4 14
C G G A T C A T
C
T
T
A
A
C
T
8 – 5 –5 +8 -5 +8 -3 +8 = 14global alignment example1
23
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 -5 -8 -11 -14 -4 -7 -10 -13
-6 -8 3 0 -3 -6 -9 -12 -15
-9 -11 0 11 8 5 2 -1 -4
-12 -14 -3 8 19 16 13 10 7
-15 -11 -6 5 16 14 24 21 18
-18 -7 -9 2 13 11 21 32 29
-21 -10 1 -1 10 8 18 29 27
G A A T C T G C
C
A
A
T
T
G
A
-5 +8 +8 +8 -3 +8 +8 -5 = 27
C A A T - T G AG A A T C T G C global alignment example2
24
Affine gap penalties• A gap of length k is penalized x + k·y.
gap-open penalty
gap-symbol penaltyThree cases for alignment endings:
1. ...x...x
2. ...x...-
3. ...-...x
an aligned pair
a deletion
an insertion
25
Affine gap penalties• Let D(i, j) denote the maximum score of any alig
nment between a1a2…ai and b1b2…bj ending with a deletion.
• Let I(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with an insertion.
• Let S(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj.
26
Affine gap penalties
),(
),(
),()1,1(
max),(
)1,(
)1,(max),(
),1(
),1(max),(
jiI
jiD
bawjiS
jiS
yxjiS
yjiIjiI
yxjiS
yjiDjiD
ji
(A gap of length k is penalized x + k·y.)
27
Affine gap penalties
• Match: +8 (w(x, y) = 8, if x = y)• Mismatch: -5 (w(x, y) = -5, if x ≠ y)• Each gap symbol: -3 (w(-,x)=w(x,-)=-3)• Each gap is charged an extra gap-open penalty: -4.
C - - - T T A A C TC G G A T C A - - T
+8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12
-4 -4
Alignment score: 12 – 4 – 4 = 4
28
END