fa08cse182 cse 182-l2:blast & variants i dynamic programming vbafna
TRANSCRIPT
![Page 1: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/1.jpg)
CSE182FA08
CSE 182-L2:Blast & variants IDynamic Programming
www.cse.ucsd.edu/classes/fa09/cse182
www.cse.ucsd.edu/~vbafna
![Page 2: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/2.jpg)
CSE182FA08
Notes
• Assignment 1 is online, due next Tuesday.• Discussion section is optional. Use it as a resource.• On the web-site, you’ll find some questions on
lectures. Ideally, you should be able to answer the questions after attending these lectures (Not all of these are trivial, so please study them carefully).
![Page 3: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/3.jpg)
CSE182FA08
Searching Sequence databases
http://www.ncbi.nlm.nih.gov/BLAST/
![Page 4: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/4.jpg)
CSE182FA08
Query:
>gi|26339572|dbj|BAC33457.1| unnamed protein product [Mus musculus]MSSTKLEDSLSRRNWSSASELNETQEPFLNPTDYDDEEFLRYLWREYLHPKEYEWVLIAGYIIVFVVALIGNVLVCVAVWKNHHMRTVTNYFIVNLSLADVLVTITCLPATLVVDITETWFFGQSLCKVIPYLQTVSVSVSVLTLSCIALDRWYAICHPLMFKSTAKRARNSIVVIWIVSCIIMIPQAIVMECSSMLPGLANKTTLFTVCDEHWGGEVYPKMYHICFFLVTYMAPLCLMILAYLQIFRKLWCRQIPGTSSVVQRKWKQQQPVSQPRGSGQQSKARISAVAAEIKQIRARRKTARMLMVVLLVFAICYLPISILNVLKRVFGMFTHTEDRETVYAWFTFSHWLVYANSAANPIIYNFLSGKFREEFKAAFSCCLGVHHRQGDRLARGRTSTESRKSLTTQISNFDNVSKLSEHVVLTSISTLPAANGAGPLQNWYLQQGVPSSLLSTWLEV
• What is the function of this sequence?• Is there a human homolog?• Which cellular organelle does it work in? (Secreted/membrane
bound)• Idea: Search a database of known proteins to see if you can
find similar sequences which have a known function
![Page 5: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/5.jpg)
CSE182FA08
Querying with Blast
![Page 6: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/6.jpg)
CSE182FA08
Blast Output
• The output (Blastp query) is a series of protein sequences, ranked according to similarity with the query
• Each database hit is aligned to a subsequence of the query
![Page 7: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/7.jpg)
CSE182FA08
Blast Output 1
query26
19 405
422
Schematic
db
Q beg
S beg
Q end
S end
S Id
![Page 8: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/8.jpg)
CSE182FA08
Blast Output 2 (drosophila)
Q beg
S beg
Q end
S end
S Id
![Page 9: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/9.jpg)
CSE182FA08
The technological question
• How do we measure similarity between sequences?
• Percent identity?
• Number of sequence edit operations?• Implies a notion of alignment that includes indels
• Technology question: Given two sequences, how do we compute a ‘good’ alignment? What is ‘good’?
A T C A A C GT C A A T G G T
A T C A A - C G -- T C A A T G G T
![Page 10: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/10.jpg)
CSE182FA08
The biology question• How do we interpret these results?
– Similar sequence in the 3 species implies that the common ancestor of the 3 had an ancestral form of that sequence.
– The sequence accumulates mutations over time. These mutations may be indels, or substitutions.
• A ‘good’ alignment might be one in which many residues are identical. However, – Hum and mus diverged more recently and so the
sequences are more likely to be similar.– Paralogs can create big problems
hum
mus
dros
hummus?
?
![Page 11: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/11.jpg)
CSE182FA08
Computing alignments
• What is an alignment?• 2Xm table. • Each sequence is a row, with interspersed gaps• Columns describe the edit operations
• What is the score of an alignment?• Score every column, and sum up the scores. Let C be the score
function for the column• How do we compute the alignment with the best score?
A A - T C G G A
A C T C G - A
![Page 12: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/12.jpg)
CSE182FA08
Optimum scoring alignments, and score of optimum alignment
• Instead of computing an optimum scoring alignment, we attempt to compute the score of an optimal alignment.
• Later, we will show that the two are equivalent
![Page 13: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/13.jpg)
CSE182FA08
Computing Optimal Alignment score
• Observations: The optimum alignment has nice recursive properties:– The alignment score is the sum of the scores of columns. – If we break off at cell k, the left part and right part must be
optimal sub-alignments.– The left part contains prefixes s[1..i], and t[1..j] for some i
and some j (we don’t know the values of i and j).
1 21
k
ts
![Page 14: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/14.jpg)
CSE182FA08
Optimum prefix alignments
• Consider an optimum alignment of the prefix s[1..i], and t[1..j]
• Look at the last cell, indexed by k. It can only have 3 possibilities.
1 kst
![Page 15: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/15.jpg)
CSE182FA08
3 possibilities for rightmost cell
1. s[i] is aligned to t[j]
2. s[i] is aligned to ‘-’
3. t[j] is aligned to ‘-’
s[i]
t[j]
s[i]t[j]
Optimum alignment of s[1..i-1], and t[1..j-1]
Optimum alignment of s[1..i-1], and t[1..j]
Optimum alignment of s[1..i], and t[1..j-1]
![Page 16: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/16.jpg)
CSE182FA08
Optimal score of an alignment
• Let S[i,j] be the score of an optimal alignment of the prefix s[1..i], and t[1..j]. It must be one of 3 possibilities.
s[i]t[j]
Optimum alignment of s[1..i-1], and t[1..j-1]
s[i]
-
Optimum alignment of s[1..i-1], and t[1..j]
-
Optimum alignment of s[1..i], and t[1..j-1]
t[j]
S[i,j] = C(si,tj)+S(i-1,j-1)
S[i,j] = C(si,-)+S(i-1,j)
S[i,j] = C(-,tj)+S(i,j-1)
![Page 17: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/17.jpg)
CSE182FA08
Optimal alignment score
• Which prefix pairs (i,j) should we use? For now, simply use all.
• If the strings are of length m, and n, respectively, what is the score of the optimal alignment?
€
S[i, j] = maxS[i −1, j −1] +C(si, t j )S[i −1, j] +C(si,−)S[i, j −1] +C(−, t j )
⎧ ⎨ ⎪
⎩ ⎪
![Page 18: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/18.jpg)
CSE182FA08
Sequence Alignment
• Recall: Instead of computing the optimum alignment, we are computing the score of the optimum alignment
• Let S[i,j] denote the score of the optimum alignment of the prefix s[1..i] and t [1..j]
![Page 19: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/19.jpg)
CSE182FA08
An O(nm) algorithm for score computation
• The iteration ensures that all values on the right are computed in earlier steps.€
S[i, j] = maxS[i −1, j −1] +C(si, t j )S[i −1, j] +C(si,−)S[i, j −1] +C(−, t j )
⎧ ⎨ ⎪
⎩ ⎪
For i = 1 to n For j = 1 to m
![Page 20: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/20.jpg)
CSE182FA08
Base case (Initialization)
€
S[0,0] = 0
S[i,0] =C(si,−) + S[i −1,0] ∀ i
S[0, j] =C(−,s j ) + S[0, j −1] ∀ j
![Page 21: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/21.jpg)
CSE182FA08
A tableaux approach
s
n
1
i
1 j n
€
MS[i,j-1] S[i,j]
S[i-1,j]S[i-1,j-1]
t
Cell (i,j) contains the score S[i,j]. Each cell only looks at 3 neighboring cells
€
S[i, j] = maxS[i −1, j −1] +C(si, t j )S[i −1, j] +C(si,−)S[i, j −1] +C(−, t j )
⎧ ⎨ ⎪
⎩ ⎪
![Page 22: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/22.jpg)
CSE182FA08
An Example
• Align s=TCAT with t=TGCAA• Match Score = 1• Mismatch score = -1, Indel Score = -1• Score A1?, Score A2?
T C A T -T G C A A
T C A TT G C A A
A1 A2
![Page 23: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/23.jpg)
CSE182FA08
0 -1 -2 -3 -4 -5
-1 1 0 -1 -2 -3
-2 0 0 1 0 -1
-3 -1 -1 0 2 1
-4 -2 -2 -1 1 1
T G C A A
T
C
A
T
Alignment Table
![Page 24: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/24.jpg)
CSE182FA08
1 1 -1 -2 -2 -4
1 2 0 -1 -1 -3
-1 0 1 0 0 -2
-3 -2 -1 0 1 -1
-5 -4 -3 -2 -1 0
T G C A A
T
C
A
T
Alignment Table
• S[4,5] = 1 is the score of an optimum alignment
• Therefore, A2 is an optimum alignment
• We know how to obtain the optimum Score. How do we get the best alignment?
![Page 25: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/25.jpg)
CSE182FA07
Computing Optimum Alignment
• At each cell, we have 3 choices• We maintain additional information to record the choice
at each step.
For i = 1 to n For j = 1 to m
€
S[i, j] = maxS[i −1, j −1] +C(si, t j )S[i −1, j] +C(si,−)S[i, j −1] +C(−, t j )
⎧ ⎨ ⎪
⎩ ⎪
If (S[i,j]= S[i-1,j-1] + C(si,tj)) M[i,j] =
If (S[i,j]= S[i-1,j] + C(si,-)) M[i,j] =
If (S[i,j]= S[i,j-1] + C(-,tj) ) M[i,j] =
j-1
i-1
j
i
![Page 26: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/26.jpg)
CSE182FA07
T G C A A
T
C
A
T 1 1 -1 -2 -2 -4
1 2 0 -1 -1 -3
-1 0 1 0 0 -2
-3 -2 -1 0 1 -1
-5 -4 -3 -2 -1 0
Computing Optimal Alignments
![Page 27: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/27.jpg)
CSE182FA07
Retrieving Opt.Alignment
1 1 -1 -2 -2 -4
1 2 0 -1 -1 -3
-1 0 1 0 0 -2
-3 -2 -1 0 1 -1
-5 -4 -3 -2 -1 0
T G C A A
T
C
A
T
• M[4,5]= Implies that
S[4,5]=S[3,4]+C(A,T) or
A
T
M[3,4]= Implies that
S[3,4]=S[2,3] +C(A,A) or
A
T
A
A
1 2 3 4 5
1
3
2
4
![Page 28: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/28.jpg)
CSE182FA08
Retrieving Opt.Alignment
1 1 -1 -2 -2 -4
1 2 0 -1 -1 -3
-1 0 1 0 0 -2
-3 -2 -1 0 1 -1
-5 -4 -3 -2 -1 0
T G C A A
T
C
A
T
• M[2,3]= Implies that
S[2,3]=S[1,2]+C(C,C) or
A
T
M[1,2]= Implies that
S[1,2]=S[1,1] +C(-,G) or
A
T
A
A
A
A
C
C
C
C
-
GT
T
1 2 3 4 5
1
3
2
4
![Page 29: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/29.jpg)
CSE182FA08
Algorithm to retrieve optimal alignment
RetrieveAl(i,j)if (M[i,j] == `\’)
return (RetrieveAl (i-1,j-1) . ) else if (M[i,j] == `|’)
return (RetrieveAl (i-1,j) . )
si
tj
si
-
-
tj
else if (M[i,j] == `--’) return (RetrieveAl (i,j-1) . )
![Page 30: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/30.jpg)
CSE182FA08
Summary
• An optimal alignment of strings of length n and m can be computed in O(nm) time
• There is a tight connection between computation of optimal score, and computation of opt. Alignment– True for all DP based solutions
![Page 31: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/31.jpg)
CSE182FA08
Global versus Local Alignment
Consider s = ACCACCCCTT
t = ATCCCCACAT
ACCACCCCTT
A TCCCCACATATCCCCACAT
ACCACCCCT T
Sometimes, this is preferable
![Page 32: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/32.jpg)
CSE182FA08
Blast Outputs Local Alignment
query26
19 405
422
Schematic
db
![Page 33: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/33.jpg)
CSE182FA08
Local Alignment
• Compute maximum scoring interval over all sub-intervals (a,b), and (a’,b’)
• How can we compute this efficiently?
a
b
a’ b’
![Page 34: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/34.jpg)
CSE182FA08
Local Alignment
• Recall that in global alignment, we compute the best score for all prefix pairs s(1..i) and t(1..j).
• Instead, compute the best score for all sub-alignments that end in s(i) and t(j).
• What changes in the recurrence?
a
i
a’ j
![Page 35: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/35.jpg)
CSE182FA08
Local Alignment
• The original recurrence still works, except when the optimum score S[i,j] is negative
• When S[i,j] <0, it means that the optimum local alignment cannot include the point (i,j).
• So, we must reset the score to 0.
ii-1
jj-1
si
tj
![Page 36: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/36.jpg)
CSE182FA08
Local Alignment Trick (Smith-Waterman algorithm)
€
S[i, j] = max
0S[i −1, j −1] +C(si, t j )S[i −1, j] +C(si,−)S[i, j −1] +C(−, t j )
⎧
⎨ ⎪
⎩ ⎪
How can we compute the local alignment itself?
![Page 37: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/37.jpg)
CSE182FA07
Generalizing Gap Cost
• It is more likely for gaps to be contiguous
• The penalty for a gap of length l should be
€
go+ ge∗l
![Page 38: FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming vbafna](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c785503460f9492db10/html5/thumbnails/38.jpg)
CSE182
End of Lecture 2
FA07