sequence alignment algorithms morten nielsen department of systems biology, dtu
TRANSCRIPT
![Page 1: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/1.jpg)
SequenceAlignment Algorithms
Morten NielsenDepartment of systems biology,
DTU
![Page 2: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/2.jpg)
Why learn about alignment algorithms?• All publicly available alignment
programs do the same thing– Sequence alignment using amino acids
substitution matrices and affine gap penalties
• This is fast but not optimal– Protein alignment is done much more
accurate using sequence profiles and position specific gap penalties (price for gaps depends on the structure)
• Must implement your own alignment algorithm to do this
![Page 3: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/3.jpg)
Outline
• What you have been told is not entirely true :-)– Alignment algorithms are more complex
• The true sequence alignment algorithm story– The slow algorithm (O3)– The fast algorithm (O2)
![Page 4: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/4.jpg)
Sequence alignmentThe old Story
![Page 5: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/5.jpg)
Pairwise alignment: the solution
”Dynamic programming” (the Needleman-Wunsch algorithm)
Match score = 1Mismatch score -1Gap penalty = -2
![Page 6: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/6.jpg)
Alignment depicted as path in matrix
T C G C A
T
C
C
A
T C G C A
T
C
C
A
TCGCATC-CA
TCGCAT-CCA
Score=2
Score=0
Match score = 1Mismatch score = -1
Gap penalty = -2
![Page 7: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/7.jpg)
Alignment depicted as path in matrix
T C G C A
T
C
C
A
x
Meaning of point in matrix: all residues up to this point have been aligned (but there are many different possible paths).
Position labeled “x”: TC aligned with TC
--TC -TC TCTC-- T-C TC
![Page 8: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/8.jpg)
Dynamic programming: computation of scores
T C G C A
T
C
C
A
x
Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).
=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.
![Page 9: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/9.jpg)
Dynamic programming: computation of scores
T C G C A
T
C
C
A
x
Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).
=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.
score(x,y) = max
score(x,y-1) - gap-penalty
![Page 10: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/10.jpg)
Dynamic programming: computation of scores
T C G C A
T
C
C
A
x
Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).
=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.
score(x,y) = max
score(x,y-1) - gap-penalty
score(x-1,y-1) + substitution-score(x,y)
![Page 11: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/11.jpg)
Dynamic programming: computation of scores
T C G C A
T
C
C
A
x
Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).
=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.
score(x,y) = max
score(x,y-1) - gap-penalty
score(x-1,y-1) + substitution-score(x,y)
score(x-1,y) - gap-penalty
![Page 12: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/12.jpg)
Dynamic programming: computation of scores
T C G C A
T
C
C
A
x
Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).
=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.
Each new score is found by choosing the maximum of three possibilities. For each square in matrix: keep track of where best score came from.
Fill in scores one row at a time, starting in upper left corner of matrix, ending in lower right corner.
score(x,y) = max
score(x,y-1) - gap-penalty
score(x-1,y-1) + substitution-score(x,y)
score(x-1,y) - gap-penalty
![Page 13: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/13.jpg)
Dynamic programming: example
A C G TA 1 -1 -1 -1C -1 1 -1 -1G -1 -1 1 -1T -1 -1 -1 1
Gaps: -2
![Page 14: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/14.jpg)
Dynamic programming: example
A C G TA 1 -1 -1 -1C -1 1 -1 -1G -1 -1 1 -1T -1 -1 -1 1
Gaps: -2
![Page 15: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/15.jpg)
Dynamic programming: example
A C G TA 1 -1 -1 -1C -1 1 -1 -1G -1 -1 1 -1T -1 -1 -1 1
Gaps: -2
![Page 16: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/16.jpg)
Dynamic programming: example
A C G TA 1 -1 -1 -1C -1 1 -1 -1G -1 -1 1 -1T -1 -1 -1 1
Gaps: -2
![Page 17: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/17.jpg)
Dynamic programming: example
T C G C A: : : :T C - C A1+1-2+1+1 = 2
![Page 18: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/18.jpg)
A now the truth
![Page 19: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/19.jpg)
Dynamic programming: computation of scores
T C G C A
T
C
C
A
x
Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).
=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.
![Page 20: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/20.jpg)
Dynamic programming: example
What about j-2 and a gap extension?
![Page 21: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/21.jpg)
And now the true algorithm
V L I L P
V
L
L
P
n
m
Start from here
![Page 22: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/22.jpg)
How does it work (score matrix d)?
5
V L I L P
V
L
L
P
1 4 1 -3
1 5 2 5 -4
1 5 2 5 -4
-3 -4 -3 -4 10
Blosum 50 matrixd matrix
![Page 23: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/23.jpg)
How does it work? (The slow way O3)
V L I L P
V
L
L
P
9 3
17 10 4
10 15 5
4 5 10
0
0
0
0
000000
Start from here!
Gap-open W1 = -5Gap-extension U = -1
d matrixD matrix
![Page 24: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/24.jpg)
How does it work? (The slow way O3)
V L I L P
V
L
L
P
17 10 4
10 15 5
4 5 10
0
0
0
0
000000
• Check all positions in (green) row and column to check score for gap extension.
• CPU intensive (O3)
Gap-open W1 = -5Gap-extension U = -1
d matrixD matrix
![Page 25: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/25.jpg)
And now you!
![Page 26: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/26.jpg)
How does it work? Fill out the D matrix
20
V L I L P
V
L
L
P
14 3
11 17 10 4
9 10 15 5
2 4 5 10
0
0
0
0
000000
• Check all positions in (green) row and column to check score for gap extension.
• CPU intensive (O3)
Gap-open W1 = -5Gap-extension U = -1
d matrixD matrix
![Page 27: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/27.jpg)
How does it work (The slow way O3)
20
V L I L P
V
L
L
P
18 14 9 3
11 15 17 10 4
8 9 10 15 5
2 3 4 5 10
0
0
0
0
000000
• Check all positions in (green) row and column to check score for gap extension.
• CPU intensive (O3)
Gap-open W1 = -5Gap-extension U = -1
d matrixD matrix
![Page 28: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/28.jpg)
And now the fast algorithm (O2)
Database (m)
Qu
ery
(n)
Open a gap Extending a gap
P
Q
Affine gap penalties
![Page 29: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/29.jpg)
And now the fast algorithm (O2)
Database (m)
Qu
ery
(n)
Open a gap Extending a gap
P
Q
![Page 30: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/30.jpg)
And now the true algorithm (cont.)
Database (m)
Qu
ery
(n)
P
Q
![Page 31: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/31.jpg)
How does it work (D,Q, and P-matrices)D
P
m
n
V L I L P
V
L
L
P
-1
5 -1
0
0
0
0
000000
V L I L P
V
L
L
P
5
2 3 4 5 10
0
0
0
0
000000
W1 = -5U = -1
![Page 32: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/32.jpg)
How does it work (D,Q,P, and E-matrices)D
P
m
n
V L I L P
V
L
L
P
0 -1
5 -1
0
0
0
0
000000
V L I L P
V
L
L
P
5
2 3 4 5 10
0
0
0
0
000000
W1 = -5U = -1
![Page 33: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/33.jpg)
How does it work (D,Q, and P-matrices)D Q
m
n
V L I L P
V
L
L
P
5
-1 -1 -1 -1 -1
0
0
000000
V L I L P
V
L
L
P
5
2 3 4 5 10
0
0
0
0
000000
W1 = -5U = -1
![Page 34: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/34.jpg)
How does it work (D,Q,P, and E-matrices)D Q
m
n
V L I L P
V
L
L
P
0 5
-1 -1 -1 -1 -1
0
0
000000
V L I L P
V
L
L
P
5
2 3 4 5 10
0
0
0
0
000000
W1 = -5U = -1
![Page 35: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/35.jpg)
How does it work (D, Q, and P-matrices)D Q
m
n
V L I L P
V
L
L
P
0 5
-1 -1 -1 -1
0
0
000000
V L I L P
V
L
L
P
5
2 3 4 5 10
0
0
0
0
000000
PV L I L P
V
L
L
P
0 -1
5 -1
0
0
0
0
000000
-1
![Page 36: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/36.jpg)
How does it work (D, Q, and P-matrices)D Q
m
n
V L I L P
V
L
L
P
0 5
-1 -1 -1 -1 -1
0
0
000000
V L I L P
V
L
L
P
15 5
2 3 4 5 10
0
0
0
0
000000
PV L I L P
V
L
L
P
0 -1
5 -1
0
0
0
0
000000
![Page 37: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/37.jpg)
How does it work. Eij-matrix.(Keeping track on the path)
eij = 1 matcheij = 2 gap-opening database (move vertical) eij = 3 gap-extension database(move vertical) eij = 4 gap-opening query (move horizontal) eij = 5 gap-extension query(move horizontal)
V L I L P
V
L
L
P
2
4 1
0
0
0
0
000000
![Page 38: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/38.jpg)
The fast algorithm (cont.)
Database (m)
Qu
ery
(n)
P
Q
eij = 1 matcheij = 2 gap-opening database eij = 3 gap-extension database eij = 4 gap-opening query eij = 5 gap-extension query
E14
5
2
3
![Page 39: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/39.jpg)
How does it work (D,Q,P, and E-matrices)D Q
m
n
V L I L P
V
L
L
P
0 5
-1 -1 -1 -1 -1
0
0
000000
V L I L P
V
L
L
P
15 5
2 3 4 5 10
0
0
0
0
000000
PV L I L P
V
L
L
P
0 -1
5 -1
0
0
0
0
000000
E
V L I L P
V
L
L
P
1 2
4 1
0
0
0
0
000000
![Page 40: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/40.jpg)
How does it work (D,Q,P, and E-matrices)D Q
P
m
n
E
6
V L I L P
V
L
L
P
10 12 9 3
3 4 5 10 4
-2 -2 -1 0 5
-1 -1 -1 -1 -1
0
0
0
0
000000
1
V L I L P
V
L
L
P
1 1 3 3
5 1 1 1 3
5 1 4 1 2
5 5 5 4 1
0
0
0
0
000000
13
V L I L P
V
L
L
P
9 4 -2 -1
11 12 5 -1 -1
8 9 10 0 -1
2 3 4 5 -1
0
0
0
0
000000
20
V L I L P
V
L
L
P
18 14 9 3
11 15 17 10 4
8 9 10 15 5
2 3 4 5 10
0
0
0
0
000000
![Page 41: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/41.jpg)
And the alignment
Dm
n
20
V L I L P
V
L
L
P
18 14 9 3
11 15 17 10 4
8 9 10 15 5
2 3 4 5 10
0
0
0
0
000000
E1
V L I L P
V
L
L
P
1 1 3 3
5 1 1 1 3
5 1 4 1 2
5 5 5 4 1
0
0
0
0
000000
VLILPVL-LP
![Page 42: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/42.jpg)
And the alignment. Gap extensions
D-matrix m
n
19
V L I PL
V
L
L
P
13 17 9 3
9 14 12 10 4
7 8 9 15 5
1 2 3 5 10
0
0
0
0
000000
E1
V L I L P
V
L
L
P
1 1 3 3
1 1 1 1 3
5 1 5 1 2
5 5 5 4 1
0
0
0
0
000000
A
10
13
10
4
0
A
1
1
4
5
0
E-matrix
![Page 43: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/43.jpg)
And the alignment
Dm
n
19
V L I PL
V
L
L
P
13 17 9 3
9 14 12 10 4
7 8 9 15 5
1 2 3 5 10
0
0
0
0
000000
E1
V L I L P
V
L
L
P
1 1 3 3
1 1 1 1 3
5 1 5 1 2
5 5 5 4 1
0
0
0
0
000000
A
10
13
10
4
0
A
1
1
4
5
0
15 -5 -1 = 9?
![Page 44: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/44.jpg)
And the alignment
Dm
n
19
V L I PL
V
L
L
P
13 17 9 3
9 14 12 10 4
7 8 9 15 5
1 2 3 5 10
0
0
0
0
000000
E1
V L I L P
V
L
L
P
1 1 3 3
1 1 1 1 3
5 1 5 1 2
5 5 5 4 1
0
0
0
0
000000
A
10
13
10
4
0
A
1
1
4
5
0
5 -5 -1 = 9?
![Page 45: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/45.jpg)
And the alignment
Dm
n
19
V L I PL
V
L
L
P
13 17 9 3
9 14 12 10 4
7 8 9 15 5
1 2 3 5 10
0
0
0
0
000000
E1
V L I L P
V
L
L
P
1 1 3 3
1 1 1 1 3
5 1 5 1 2
5 5 5 4 1
0
0
0
0
000000
A
10
13
10
4
0
A
1
1
4
5
0
VLIALPVL--LP
15 -5 -1 = 9? ***
5 -5 -1 = 9?
![Page 46: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/46.jpg)
And now you!
![Page 47: Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU](https://reader035.vdocuments.us/reader035/viewer/2022062423/56649de55503460f94add50c/html5/thumbnails/47.jpg)
Summary
• Alignment is more complicated than what you have been told.
• Simple algorithmic tricks allow for alignment in O2 time
• More heuristics to improve speed– Limit gap length– Look for high scoring
regions