bio medical tics - sequence analysis- alignment- 2011
Post on 07-Apr-2018
214 views
TRANSCRIPT
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
1/96
1
D N A S E Q U E N C I N G
& S E Q U E N C E A L I G N M E N T
Sequence Analysis
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
2/96
2
Sequence Alignment
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
3/96
3
Study Goal
What is a sequence alignment?
The difference between a global and local alignment and what theuses of each are.
How to use the dot matrix methods to analyze genes andchromosomes
The steps performed by the Needleman-Wunsh and Smith-Waterman Algorithms to produce a sequence alignment
How to use scoring matrix values and gap penalties to produce asequence alignment
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
4/96
4
Sequence Alignment
Definition: the comparison of two or more sequences bysearching for a series of individual characters or characterpatterns that are in the same order in the sequences
Sequence alignment score:
Sum of the individual log odds scores for each pair of alignedsequence characters in an alignment less a penalty for each gap ofone more position
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
5/96
5
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
6/96
6
Pair-wise Sequence Alignment
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
DefinitionGiven two strings x = x1x2...xM, y = y1y2yN,
an alignment is an assignment of gaps to positions0,, N in x, and 0,, N in y, so as to line up eachletter in one sequence with either a letter, or a gapin the other sequence
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
7/96
7
What is a good alignment?
AGGCTAGTT, AGCGAAGTTT
AGGCTAGTT- 6 matches, 3 m ismatches, 1 gap
AGCGAAGTTT
AGGCTA-GTT- 7 m atches, 1 m isma tch, 3 gaps
AG-CGAAGTTT
AGGC-TA-GTT- 7 m atches, 0 m isma tches, 5 gapsAG-CG-AAGTTT
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
8/96
8
Evolution at the DNA level
SEQUENCE EDITSACGGTGCAGTTACCA
AC----CAGTCCACCA
MutationDeletion
REARRANGEMENTS
InversionTranslocationDuplication
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
9/96
9
Evolutionary
Still OK?
OK
OK
OK
X
X
next generation
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
10/96
10
Scor ing Function
Sequence edits: AGGCCTC
Mutations AGGACTC
Insertions AGGGCCTC Deletions AGG . CTC
Scor ing Function: Match: +m
Mismatch: -s
Gap: -d
Score F = (# matches) m - (# mismatches) s(#gaps) d
Alternative definition:
m in im a l ed i t d i st a n ce
Given two strings x, y,find minimum # of edits
(insertions, deletions, mutations)
to transform one string to theother
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
11/96
11
Pair-wise Alignment
The alignment of two sequences (DNA or protein) is
a relatively straightforward computational problem. There are lots of possible alignments.
Two sequences can always be aligned.
Sequence alignments have to be scored. Often there is m ore than one solution with the
same score.
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
12/96
12
How do we compute the best alignment?
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACC
CTGACCCTGGGTCACAAAACTC
Too many possible
alignments:
>> 2N
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
13/96
13
Local alignment vs. Global Alignment
Use to align the entire sequence
Best for same length sequence
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
14/96
14
Local alignment vs. Global Alignment
Use to align the similar sequence along certain length
Best for sequence sharing a conserved region or domain
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
15/96
15
What do we want alignment for?
Xenologous transferred genes
Orthologous
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
16/96
16
Matches to similar sequences
Sequence conservation Structure conservation
function conservation What is conserved in a gene [protein] family is functionally important
Due to purifying selection driven by functional constraints observable in abckground described by the theory of neutral evolution
Fast enough that pseudogenes rapidly deteriorate over evolutionarytimescale
In any prokaryotic genome, homologs from more than one distantly relatedspecies are detectable for 70 80 % of proteins
Application: Comparison of sequence/structures can identifyhomologous relationships, allowing inference of function based on thatrelationship
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
17/96
17
Methods of Alignment
By hand - slide sequences on two lines of a word
processor Dot plot
with windows
Rigorous mathematical approach Dynamic programming (slow, optimal)
Heuristic methods (fast, approximate)
BLAST and FASTAWord matching and hash tables0
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
18/96
18
Align by Hand
GATCGCCTA_TTACGTCCTGGAC AGGCATACGTA_GCCCTTTCGC
You still need some kind of scoring system to find thebest alignment
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
19/96
Dotplot19
A
T
T
C
A C
A
T
A
T A C A T T A C G T A C
Sequence 1
Sequence 2
A dotplot gives an overview of all possible alignments
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
20/96
Dotplot20
A
T
T
C
A
C A
T
A
T A C A T T A C G T A C
T A C A T T A C G T A C
A T A C A C T T A
Sequence 1
Sequence 2
One possible alignment:
In a dotplot each diagonal corresponds to a possible(ungapped) alignment
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
21/96
21Insertions / Deletions in a DotplotT
A
CT
G
T
C
A
T
T A C T G T T C A TSequence 1
Sequence 2
T A C T G - T C A T| | | | | | | | |
T A C T G T T C A T
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
22/96
22
Hemoglobin -chain
Hemoglobin
-chain
Dotplot(Window = 130 / Stringency = 9)
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
23/96
23Word Size Algorithm
TA C G G T A T G
A C AG T A T C
T A C G G T A T G
A C A G T A T C
T A C G G T A T G
A CA G T A T C
T A C G G T AT G
A C A G T AT C
C
T
A
T G
A
C
A
T A C G G T A T G
Word Size = 3
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
24/96
24
PTHPLASKTQILPEDLASEDLTI
PTHPLAGERAIGLARLAEEDFGM
Score = 7
PTHPLASKTQILPEDLASEDLTI
PTHPLAGERAIGLARLAEEDFGM
Score = 11
Matrix: PAM250
Window = 12
Stringency = 9
Scoring Matrix Filtering
PTHPLASKTQILPEDLASEDLTI
PTHPLAGERAIGLARLAEEDFGM
Score = 11
Window / Stringency
Typical window size for DNA sequences is 15 bases, and stringency or matchrequirement in this window is 10, meaning if there are 10-15 matches within thewindow, and a dot is printed at the first base in the windows
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
25/96
25Dotplot(Window = 18 / Stringency = 10)
Hemoglobin
-chain
Hemoglobin -chain
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
26/96
Considerations
The window/stringency method is more sensitive than thewordsize method (ambiguities are permitted).
The smaller the window, the larger the weight of statistical(unspecific) matches.
With large windows the sensitivity for short sequences isreduced.
Insertions/deletions are not treated explicitly.
26
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
27/96
27
Dot Matrix
Unless two sequences are known to be very much alike, thedot matrix method should be considered as a first choice for
pair-wise sequence alignment Readily reveal the presence of insertions/deletions and
direct and inverted repeats
DNA sequence dot matrix comparison: long windows andhigh stringencies (7/11, 11/15)
Protein sequences: short windows, stringencies (1/1) exceptfor short domain of partial similaritity in not similar
sequences (15/5)
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
28/96
Programs:
DOTTER -
http://www.cgb.ki.se/cgb/groups/sonnhammer/Dotter.html
DOTPLOT - http://helix.nih.gov/docs/gcg/dotplot.html
PLALIGN in FASTA EMBOSS http://emboss.sourceforge.net/download/
28
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
29/96
29
Methods of Alignment
By hand - slide sequences on two lines of a wordprocessor
Dot plot
with windows
Rigor ous m athem atical appro ach Dynam ic progr am m ing (slow , optim al)
Heuristic methods (fast, approximate)
BLAST and FASTAWord matching and hash tables0
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
30/96
30
Alignment is additive
Observation:
The score of aligning x1xMy1yN
is additive
Say that x1xi xi+1xMaligns to y1yj yj+1yN
The two scores add up:
F(x[1:M], y[1:N]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:N])
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
31/96
31
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
32/96
Basic pr inciples of dynam ic program m ing32
- Creation of an alignment path matrix
- Stepwise calculation of score values
- Backtracking (evaluation of the optimal path)
It is applicable when a large search space can be structuredinto a succession of stages, such that the initial stage contains
trivial solutions to sub-problems, each partial solution in a later
stage can be calculated by recurring a fixed number of partial
solutions in an earlier stage, the final stage contains the overallsolution - Mount
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
33/96
33Creation of an alignment path matrix
Idea:
Build up an optimal alignment using previous solutions foroptimal alignments of smaller subsequences
Construct matrix Findexed by iandj(one index for each sequence)
F(i,j) is the score of the best alignment between the initial
segment x1...iof xup to xi and the initial segment y1...j of yup to
yj
Build F(i,j) recursively beginning with F(0,0) = 0
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
34/96
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
35/96
35
If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can
calculate F(i,j) Three possibilities:
xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj)
xi is aligned to a gap, F(i,j) = F(i-1,j) - d yj is aligned to a gap, F(i,j) = F(i,j-1) - d
The best score up to (i,j) will be the largest of the threeoptions
Creation of an alignment path matrix
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
36/96
36
Dynam ic Program m ing
There are only a polynomial number of subproblems
Align x1
xi
to y1
yj
Original problem is one of the subproblems
Align x1xM to y1yN Each subproblem is easily solved from smaller
subproblems
???
Then, we can applyDynamic Programming!!!
Let F(i,j) = optimal score of aligning
x1xiy1yj
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
37/96
37
Example of Dynamic programming algorithm
Sequences a1, a2, a3, a4 and b1, b2, b3, b4
Align sequences in a global alignment
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
38/96
38
Example of Dynamic programming algorithm
Sequences a1, a2, a3, a4 and b1, b2, b3, b4
Align sequences in a global alignment
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
39/96
39
Example of Dynamic programming algorithm
Sequences a1, a2, a3, a4 and b1, b2, b3, b4
Align sequences in a global alignment
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
40/96
40
Example of Dynamic programming algorithm
Sequences a1, a2, a3, a4 and b1, b2, b3, b4
Align sequences in a global alignment
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
41/96
41
Example of Dynamic programming algorithm
Sequences a1, a2, a3, a4 and b1, b2, b3, b4
Align sequences in a global alignment
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
42/96
42
Example of Dynamic programming algorithm
Sequences a1, a2, a3, a4 and b1, b2, b3, b4
Align sequences in a global alignment
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
43/96
43
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
44/96
Types of Scores44
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
45/96
Scoring Matr ix / Substitution Matr ix45
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
46/96
Scoring Matrix: Example
A R N K
A 5 -2 -1 -1
R - 7 -1 3
N - - 7 0
K - - - 6
Notice that although R andK are different amino acids,
they have a positive score.
Why? They are bothpositively charged amino
acidswill not greatlychange function of protein.
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
47/96
Conservation
Amino acid changes that tend to preserve thephysicochemical properties of the original residue
Polar to polar
aspartate glutamate
Nonpolar to nonpolaralanine valine
Similarly behaving residues
leucine to isoleucine
l
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
48/96
Scoring Examples48
i i
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
49/96
49
Dynam ic Program m ing
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
50/96
50
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
51/96
51
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
52/96
52
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
53/96
53
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
54/96
54
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
55/96
55
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
56/96
56
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
57/96
57
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
58/96
58
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
59/96
Scoring systems
59
DNA Scor ing System s -very sim ple
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
60/96
DNA Scor ing System s -very sim ple60
actaccagttcatttgatacttctcaaataccattaccgtgttaactgaaaggacttaaagact
Sequence 1Sequence 2
A G C T
A 1 0 0 0
G 0 1 0 0
C 0 0 1 0
T 0 0 0 1
Match: 1Mismatch: 0
Score = 5
Protein Scoring System s
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
61/96
Protein Scoring System s61
PTHPLASKTQILPEDLASEDLTI
PTHPLAGERAIGLARLAEEDFGM
Sequence 1
Sequence 2
Scoringmatrix
T:G = -2T:T = 5Score = 48
C S T P A G N D . .
C 9
S -1 4
T -1 1 5
P -3 -1 -1 7
A 0 1 0 -1 4
G -3 0 -2 -2 0 6
N -3 1 0 -2 -2 0 5
D -3 0 -1 -1 -2 -1 1 6
.
.
C S T P A G N D . .
C 9
S -1 4
T -1 1 5
P -3 -1 -1 7
A 0 1 0 -1 4
G -3 0 -2 -2 0 6
N -3 1 0 -2 -2 0 5
D -3 0 -1 -1 -2 -1 1 6
.
.
Protein Scoring System s
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
62/96
Protein Scoring System s62
Amino acids have different biochemical and physical propertiesthat influence their relative replaceability in evolution.
CP
GGA
VI
L
MF
Y
W H
K
R
EQ
DNS
TCSH
S+S
positive
charged
polar
aliphatic
aromatic
small
tiny
hydrophobic
Protein Scoring System s
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
63/96
Protein Scoring System s
Scoring m atr ices r eflect:
# of mu tations to convert one to another
chem ical similarity
observed m utation frequen cies
the pr obability of occur rence of each am ino acid
W idely used scoring m atrices:
PAM
BLOSUM
63
PAM 250
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
64/96
64
A R N D C Q E G H I L K M F P S T W Y V B Z
A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 2 1
R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 1 2
N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 4 3
D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 5 4C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -3 -4
Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 3 5
E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 4 5
G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 2 1
H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 3 3
I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -1 -1
L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -2 -1K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 2 2
M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -1 0
F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -3 -4
P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 1 1
S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 2 1
T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 2 1
W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -4 -4
Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -2 -3
V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 0 0
B 2 1 4 5 -3 3 4 2 3 -1 -2 2 -1 -3 1 2 2 -4 -2 0 6 5
Z 1 2 3 4 -4 5 5 1 3 -1 -1 2 0 -4 1 1 1 -4 -3 0 5 6
PAM 250
C
-8 17
W
W
PAM
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
65/96
65
Dayhoff Matr ix
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
66/96
y66
Dayhoff Matr ix
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
67/96
y67
Dayhoff Matr ix
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
68/96
y68
BLOSUM Matrix
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
69/96
69
BLOSUM Matrix
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
70/96
70
AABCDA...BBCDA
DABCDA.A.BBCBB
BBBCDABA.BCCAA
AAACDAC.DCBCDB
CCBADAB.DBBDCC
AAACAA...BBCCC
Conserved blocks in alignments
BLOSUM Matrix
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
71/96
71
AABCDA...BBCDA
DABCDA.A.BBCBB
BBBCDABA.BCCAA
AAACDAC.DCBCDB
CCBADAB.DBBDCC
AAACAA...BBCCC
Conserved blocks in alignments
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
72/96
72
TIPS on choosing a scoring matrix
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
73/96
73
S o c oos g a sco g at
Generally, BLOSUM matrices perform better than PAM matricesfor local similarity searches (Henikoff & Henikoff, 1993).
When comparing closely related proteins one should use lowerPAM or higher BLOSUM matrices, for distantly related proteinshigher PAM or lower BLOSUM matrices.
For database searching the commonly used matrix is BLOSUM62.
Significance of alignm ent
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
74/96
74
Significance of alignm ent
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
75/96
75
Database Sear ching
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
76/96
76
Database Sear ching
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
77/96
77
Local vs. Global Alignm ent
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
78/96
Global Alignm ent
Local Alignm entbetter alignm ent to find
conserved segment
--T-CC-C-AGT-TATGT-CAGGGGACACGA-GCATGCAGA-GAC
| || | || | | | ||| || | | | | |||| |AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-CAGAT--C
tccCAGTTATGTCAGgggacacgagcatgcagagac
||||||||||||
aattgccgccgtcgttttcagCAGTTATGTCAGatc
Global vs. Local Alignm ent
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
79/96
79
Global Alignment Needleman and Wunsch (1970)
Local Alignment Smith and Waterman (1981a)
Global Alignment
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
80/96
needle (Needleman & Wunsch) creates an end-to-end
alignment.
Two closely related sequences:
Global Alignment
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
81/96
Two sequences sharing several regions of local similarity:
1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAATTAAAGAGGAGGTAGACCG.... 67
1 AGGATTGGAATGCTAGGCTTGATTGCCTACCTGTAGCCACATCAGAAGCACTAAAGCGTCAGCGAGACCG 70
|||||||||||||| | | | |||| || | | | ||
The Needlem an-W unsch Matr ix
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
82/96
82
x1 xMy1
yN
Every non-decreasing
path
from (0,0) to (M, N)
corresponds toan alignment
of the two sequences
An optimal alignment is composed ofoptimal subalignments
The Needlem an-W unsch Algor ithm
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
83/96
831. Initialization.
F(0, 0) = 0 , F(0, j) = - j d, F(i, 0)= - i d
2. Main Iteration. Filling-in partial alignmentsa. For each i = 1M
For each j = 1N
F(i-1,j-1) + s(xi, yj ) [case 1]
F(i, j) = max F(i-1, j) d [case 2]
F(i, j-1) d [case 3]
DIAG, if [case 1]
Ptr(i,j) = LEFT, if [case 2]
UP, if [case 3]
3. Termination. F(M, N) is the optimal score, and
from Ptr(M, N) can trace back optimal alignment
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
84/96
Local Alignment (Smith-Waterman)
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
85/96
Local alignment
Identify the most similar sub-region shared between two
sequences Smith-Waterman
The local alignm ent problem
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
86/96
86
Given two strings x = x1xM,
y = y1yN
Find substrings x, ywhose similarity
(optimal global alignment value)
is maximum
x = aaaacccccggggtta
y = ttcccgggaaccaacc
87
W hy local alignm ent examples
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
87/96
87
Genes areshuffled between
genomes
Portions ofproteins(domains) areoften conserved
88
Cross-species genom e sim ilar ity
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
88/96
88
98% of genes are conserved between any two mammals
>70% average similarity in protein sequence
hum_a : GTTGACAATAGAGGGTCTGGCAGAGGCTC--------------------- @ 57331/400001
mus_a : GCTGACAATAGAGGGGCTGGCAGAGGCTC--------------------- @ 78560/400001
rat_a : GCTGACAATAGAGGGGCTGGCAGAGACTC--------------------- @ 112658/369938
fug_a : TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG @ 36008/68174
hum_a : CTGGCCGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 57381/400001mus_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 78610/400001
rat_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 112708/369938
fug_a : TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG @ 36058/68174
hum_a : AGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCTGTGCGGCCACATTT @ 57431/400001
mus_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT @ 78659/400001
rat_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGCGCGGCCACATTT @ 112757/369938
fug_a : AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC @ 36084/68174
hum_a : AACACCATCATCACCCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 57481/400001
mus_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 78708/400001
rat_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 112806/369938
fug_a : CCGAGGACCCTGA------------------------------------- @ 36097/68174
atoh enhancer inhuman, mouse, rat,fugu fish
89
The Sm ith-W aterm an algor ithm
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
89/96
89
Termination:
1. If we want the best local alignment
FOPT = maxi,j F(i, j)Find FOPT and trace back
2. If we want all local alignments scoring > t
For all i, j find F(i, j) > t, and trace backComplicated by overlapping local alignments
(WatermanEggert 87: find all non-overlapping local alignments with minimal
recalculation of the DP matrix)
90
Sm ith-w aterm an algor ithm
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
90/96
90
Basic Local Alignm ent Search Tool
91
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
91/96
91
BLAST Algor ithm
92
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
92/96
92
BLAST Algor ithm
93
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
93/96
93
BLAST Algor ithm
94
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
94/96
94
BLAST Algor ithm
95
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
95/96
95
Basic BLAST Algorithms
96
-
8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011
96/96