sequence alignment - diunitobotta/didattica/sequencealignments.pdf · gap penalty models • convex...
Post on 19-Oct-2020
7 Views
Preview:
TRANSCRIPT
1
1
Sequence Alignment
Marco BottaDipartimento di Informatica
Università di Torinobotta@di.unito.it
www.di.unito.it/~botta/didattica/
2Università di Torino
Sequence Comparison
Much of bioinformatics involves sequences• DNA sequences• RNA sequences• Protein sequencesWe can think of these sequences as strings of
letters• DNA & RNA: alphabet of 4 letters• Protein: alphabet of 20 letters
3Università di Torino
Sequence Comparison (cont)
• Finding similarity between sequences isimportant for many biological questions
For example:• Find genes/proteins with common origin
– Allows to predict function & structure
• Locate common subsequences in genes/proteins– Identify common “motifs”
• Locate sequences that might overlap– Help in sequence assembly
2
4Università di Torino
Sequence Alignment
Input: two sequences over the same alphabetOutput: an alignment of the two sequencesExample:• GCGCATGGATTGAGCGA
• TGCGCCATTGATGACCA
A possible alignment:-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
5Università di Torino
Alignments
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
Three elements:• Perfect matches• Mismatches• Insertions & deletions (indel)
6Università di Torino
Choosing Alignments
There are many possible alignmentsFor example, compare:
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
to------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA--
Which one is better?
3
7Università di Torino
Scoring Alignments
Rough intuition:• Similar sequences evolved from a common ancestor• Evolution changed the sequences from this
ancestral sequence by mutations:– Replacements: one letter replaced by another– Deletion: deletion of a letter– Insertion: insertion of a letter
• Scoring of sequence similarity should examine howmany operations took place
8Università di Torino
Simple Scoring Rule
Score each position independently:• Match: +1• Mismatch: -1• Indel -2Score of an alignment is sum of positional
scores
9Università di Torino
Example
Example:-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
Score: (+1x13) + (-1x2) + (-2x4) = 3
------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA--
Score: (+1x5) + (-1x6) + (-2x12) = -25
4
10Università di Torino
More General Scores
• The choice of +1,-1, and -2 scores was quitearbitrary
• Depending on the context, some changesare more plausible than others– Exchange of an amino-acid by one with similar
properties (size, charge, etc.)vs.– Exchange of an amino-acid by one with
opposite properties
11Università di Torino
• We define a scoring function by specifyinga function
– σ(x,y) is the score of replacing x by y– σ(x,-) is the score of deleting x– σ(-,x) is the score of inserting x
• The score of an alignment is the sum ofposition scores
Additive Scoring Rules
ℜ−∪Σ×−∪Σ �}){(}){(:σ
12Università di Torino
Edit Distance
• The edit distance between two sequences isthe “cost” of the “cheapest” set of editoperations needed to transform onesequence into the other
• Computing edit distance between twosequences almost equivalent to finding thealignment that minimizes the distance
nment)score(aligmax),d( & of alignment 21 ss21 ss =
5
13Università di Torino
Computing Edit Distance
• How can we compute the edit distance??– If |s| = n and |t| = m, there are more than
alignments
• The additive form of the score allows toperform dynamic programming tocompute edit distance efficiently
+m
nm
14Università di Torino
Recursive Argument
• Suppose we have two sequences:s[1..n+1] and t[1..m+1]
The best alignment must be in one of three cases:1. Last position is (s[n+1],t[m +1] )2. Last position is (s[n +1],-)3. Last position is (-, t[m +1] )
])[],[(])..[],..,[(])..[],..[(1mt1ns
m1tn1sd1m1t1n1sd++
+=++σ
15Università di Torino
Recursive Argument
• Suppose we have two sequences:s[1..n+1] and t[1..m+1]
The best alignment must be in one of three cases:1. Last position is (s[n+1],t[m +1] )2. Last position is (s[n +1],-)3. Last position is (-, t[m +1] )
)],[(])..[],..,[(])..[],..[(
−+++=++
1ns1m1tn1sd1m1t1n1sd
σ
6
16Università di Torino
Recursive Argument
• Suppose we have two sequences:s[1..n+1] and t[1..m+1]
The best alignment must be in one of three cases:1. Last position is (s[n+1],t[m +1] )2. Last position is (s[n +1],-)3. Last position is (-, t[m +1] )
])1[,(])..1[],1..,1[(])1..1[],1..1[(
+−++=++
mtmtnsdmtnsd
σ
17Università di Torino
Recursive Argument
Define the notation:
• Using the recursive argument, we get thefollowing recurrence for V:
+−++−++++++
=++])[,(],[)],[(],[
])[],[(],[max],[
1jtj1iV1is1jiV
1jt1isjiV1j1iV
σσ
σ
])..[],..[(],[ j1ti1sdjiV =
18Università di Torino
Recursive Argument
• Of course, we also need to handle the basecases in the recursion:
])[,(],[],[)],[(],[],[
],[
1jtj0V1j0V1is0iV01iV
000V
+−+=+−++=+
=
σσ
7
19Università di Torino
Dynamic ProgrammingAlgorithm
We fill the matrix using the recurrence rule
0A1
G2
C3
0
A 1
A 2
A 3
C 4
20Università di Torino
0A1
G2
C3
0 0 -2 -4 -6
A 1 -2 1 -1 -3
A 2 -4 -1 0 -2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
0A1
G2
C3
0 0 -2 -4 -6
A 1 -2 1 -1 -3
A 2 -4 -1 0 -2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
Dynamic ProgrammingAlgorithm
Conclusion: d(AAAC,AGC) = -1
21Università di Torino
Reconstructing the BestAlignment
• To reconstruct thebest alignment, werecord which case inthe recursive rulemaximized the score
0A1
G2
C3
0 0 -2 -4 -6
A 1 -2 1 -1 -3
A 2 -4 -1 0 -2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
8
22Università di Torino
Reconstructing the BestAlignment
• We now trace back thepath that corresponds tothe best alignment
0A1
G2
C3
0 0 -2 -4 -6
A 1 -2 1 -1 -3
A 2 -4 -1 0 -2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
AAACAG-C
23Università di Torino
Reconstructing the BestAlignment
• Sometimes, more thanone alignment has thebest score
0A1
G2
C3
0 0 -2 -4 -6
A 1 -2 1 -1 -3
A 2 -4 -1 0 -2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
AAACA-GC
24Università di Torino
Complexity
Space: O(mn)Time: O(mn)• Filling the matrix O(mn)• Backtrace O(m+n)
9
25Università di Torino
Local Alignment
Consider now a different question:• Can we find similar substring of s and t ?• Formally, given s[1..n] and t[1..m] find
i,j,k, and l such that d(s[i..j],t[k..l]) ismaximal
26Università di Torino
Local Alignment
• As before, we use dynamic programming• We now want to setV[i,j] to record the best
alignment of a suffix of s[1..i] and a suffixof t[1..j]
• How should we change the recurrence rule?
27Università di Torino
Local Alignment
New option:• We can start a new match instead of extend
previous alignment
+−++−++++++
=++
01jtj1iV
1is1jiV1jt1isjiV
1j1iV])[,(],[)],[(],[
])[],[(],[max],[
σσ
σ
Alignment of empty suffixes
10
28Università di Torino
]))[,(],[,max(],[))],[(],[,max(],[
],[
1jtj0V01j0V1is0iV001iV
000V
+−+=+−++=+
=
σσ
Local Alignment
• Again, we also need to handle the basecases in the recursion:
29Università di Torino
Local Alignment Example
0A1
T2
C3
T4
A5
A6
0
T 1
A 2
A 3
T 4
A 5
s = TAATAt = ATCTAA
30Università di Torino
Local Alignment Example
0T1
A2
C3
T4
A5
A6
0 0 0 0 0 0 0 0
T 1 0 1 0 0 1 0 0
A 2 0 0 2 0 0 2 1
A 3 0 0 1 1 0 1 3
T 4 0 0 0 0 2 0 1
A 5 0 0 1 0 0 3 1
s = TAATAt = TACTAA
11
31Università di Torino
Local Alignment Example
0T1
A2
C3
T4
A5
A6
0 0 0 0 0 0 0 0
T 1 0 1 0 0 1 0 0
A 2 0 0 2 0 0 2 1
A 3 0 0 1 1 0 1 3
T 4 0 0 0 0 2 0 1
A 5 0 0 1 0 0 3 1
s = TAATAt = TACTAA
32Università di Torino
Local Alignment Example
0T1
A2
C3
T4
A5
A6
0 0 0 0 0 0 0 0
T 1 0 1 0 0 1 0 0
A 2 0 0 2 0 0 2 1
A 3 0 0 1 1 0 1 3
T 4 0 0 0 0 2 0 1
A 5 0 0 1 0 0 3 1
s = TAATAt = TACTAA
33Università di Torino
Sequence Alignment
We have seen two variants of sequence alignment:• Global alignment• Local alignmentOther variants:• Finding best overlap (exercise)
All are based on the same basic idea of dynamicprogramming
12
34Università di Torino
Alignment with GapsAAC-AATTAAG-ACTAC-GTTCATGAC
A-CGA-TTA-GCAC-ACTG-T-C-GA-
AACAATTAAGACTACGTTCATGAC---
AACAATT--------GTTCATGACGCA
I
II
35Università di Torino
Gaps
• Both alignments have the same number ofmatches and spaces but…alignment IIseems better.
• Definition: A gap is any maximal,consecutive run of spaces in a single string.
• The length of the gap will be the number ofspaces in it.
• Example I has 11 gaps while example II hasonly 2 gaps.
36Università di Torino
Biological Motivation
• Number of mutational events– A single gap - due to single event that removed
a number of residues.– Each separate gap - due to distinct independent
events.• Protein structure
– Protein secondary structure consists of alphahelixes, beta sheets and loops
– Loops of varying size can lead to very similarstructure.
13
37Università di Torino
Biological Motivation
38Università di Torino
cDNA matching
• cDNA - is the sequence after splicing and editing,after the introns have been removed.
• We expect regions of high similarity separated bylong gaps.
• These gaps correspond to the introns removed bysplicing
39Università di Torino
Gap Penalty Models• Constant Model
– Gives each gap a constant weight, spaces are free– Maximize:– Time– Works well for cDNA matching
• Affine Model– There is a penalty for starting a gap and a penalty for
each space extending it.– A single gap contributes– Maximize:– Time– Most widely used
gapss WTS gii #),( '' ×+∑)(nmO
spacesgapss WWTS sgii ##),( '' ×+×+∑)(nmO
WW sg q+
14
40Università di Torino
Gap Penalty Models
• Convex model– Each extra space contributes less penalty– Gap function is convex in length– Example– Time– Better model of biology
• General model– The weight of a gap is some arbitrary– Time
qW g log+
)log( mnmO
)(qw
)( 22 mnnmO +
41Università di Torino
Example RevisedAAC-AATTAAG-ACTAC-GTTCATGAC
A-CGA-TTA-GCAC-ACTG-T-C-GA-
AACAATTAAGACTACGTTCATGAC---
AACAATT--------GTTCATGACGCA
I
II
42Università di Torino
Indel modelAAC-AATTAAG-ACTAC-GTTCATGAC
A-CGA-TTA-GCAC-ACTG-T-C-GA-I
AACAATTAAGACTACGTTCATGAC---
AACAATT--------GTTCATGACGCAII
Scoring Parameters:Match: +1indel: -2
-6
-6
15
43Università di Torino
Constant modelAAC-AATTAAG-ACTAC-GTTCATGAC
A-CGA-TTA-GCAC-ACTG-T-C-GA-I
AACAATTAAGACTACGTTCATGAC---
AACAATT--------GTTCATGACGCAII
Scoring Parameters:Match: +1open gap: -2
-6
12
44Università di Torino
Affine modelAAC-AATTAAG-ACTAC-GTTCATGAC
A-CGA-TTA-GCAC-ACTG-T-C-GA-I
AACAATTAAGACTACGTTCATGAC---
AACAATT--------GTTCATGACGCAII
Scoring Parameters:Match: +1Open Gap: -2Extend Gap:-1
-17
1
45Università di Torino
Convex modelAAC-AATTAAG-ACTAC-GTTCATGAC
A-CGA-TTA-GCAC-ACTG-T-C-GA-I
AACAATTAAGACTACGTTCATGAC---
AACAATT--------GTTCATGACGCAII
Scoring Parameters:Match: +1Open Gap: -2 Gap Length: -logn
-6
~7
16
46Università di Torino
Affine Weight Model
• We divide the possible alignments of theprefixes S1..i and T1..j into 3 types:
S_______iT_______j
S______i------T____________j
S____________iT_______j-----
A(i,j)
B(i,j)
C(i,j)
47Università di Torino
Affine Weight Model
Recurrence relations
+−−+−−+−−
=),()1,1(),()1,1(),()1,1(
max),(jisjiCjisjiBjisjiA
jiA
+−−
++−−=
s
sg
WjiBWWjiA
jiB)1,1(
)1,1(max),(
+−−
++−−=
s
sg
WjiCWWjiA
jiC)1,1(
)1,1(max),(
48Università di Torino
Affine Weight Model
Initial Conditions:
sg
sg
jWWiCjAiWWiBiA
+==
+==
)0,(),0(
)0,()0,(
Complexity•Time: O(nm) we compute 3 matrices.•Space: O(nm)
Optimal alignment :)},(),,(),,(max{),( mnCmnBmnAmnV =
17
49Università di Torino
Affine Weight Model
This model has a natural explanation as afinite state automata.
A
B
C
S(i,j)
Ws
Ws
Wg+Ws
S(i,j)
S(i,j)
Wg+Ws
50Università di Torino
Alignment in Real Life
• One of the major uses of alignments is tofind sequences in a “database”
• Such collections contain massive number ofsequences (order of 106)
• Finding homologies in these databases withdynamic programming can take too long
51Università di Torino
Heuristic Search
• Instead, most searches relay on heuristicprocedures
• These are not guaranteed to find the best match• Sometimes, they will completely miss a high-
scoring match
We now describe the main ideas used by some ofthese procedures– Actual implementations often contain additional tricks and hacks
18
52Università di Torino
Basic Intuition
• Almost all heuristic search procedure arebased on the observation that real-lifematches often contain long strings with gap-less matches
• These heuristic try to find significant gap-less matches and then extend them
53Università di Torino
Banded DP
• Suppose that we have two strings s[1..n] andt[1..m] such that n≈m
• If the optimal alignment of s and t has few gaps,then path of the alignment will be close todiagonal s
t
54Università di Torino
Banded DP
• To find such a path, it suffices tosearch in a diagonal region ofthe matrix
• If the diagonal band has width k,then the dynamic programmingstep takes O(kn)
• Much faster than O(n2) ofstandard DP
s
t k
19
55Università di Torino
Banded DP
Problem:• If we know that t[i..j] matches the query s,
then we can use banded DP to evaluatequality of the match
• However, we do not know this apriori!
How do we select which sequences to alignusing banded DP?
56Università di Torino
FASTA Overview
Main idea:• Find potential diagonals & evaluate them• Suppose that we have a relatively long gap-
less matchAGCGCCATGGATTGAGCGA
TGCGACATTGATCGACCTA
• Can we find “clues” that will let us find itquickly?
57Università di Torino
Signature of a Match
Assumption: good matches contain several“patches” of perfect matches
AGCGCCATGGATTGAGCGA
TGCGACATTGATCGACCTA
Since this is a gap-less alignment,all perfect match regionsshould be on one diagonal
s
t
20
58Università di Torino
FASTA
• Given s and t, and a parameter k• Find all pairs (i,j) such that s[i..i+k]=t[j..j+k]• Locate sets of pairs that are on the same diagonal
– By sorting according to i-j
• Compute score for the diagonal that containall of these pairs
s
t
59Università di Torino
FASTA
Postprocessing steps:– Find highest scoring diagonal matches– Combine these to potential gapped matches– Run banded DP on the region containing these
combinations
• Most applications of FASTA use very small k(2 for proteins, and 4-6 for DNA)
60Università di Torino
FASTA Output
SCORES Init1: 1201 Initn: 1844 Opt: 1915
Smith-Waterman score: 1915; 59.3% identity in 496 aa overlap
10 20 30 40
A41264 MADKKKITASLIYAVSVAAIGSLQFGYNTGVINAPEKIIQAFYNRTL
::::|::|: || |::|||||||| ||||||:|:|: ||:|
A49158 MPSGFQQIGSEDGEPPQQRVTGTLVLAVFSAVLGSLQFGYNIGVINAPQKVIEQSYNETW
10 20 30 40 50 60
50 60 70 80 90 100
A41264 SQRSG----ETISPELLTSLWSLSVAIFSVGGMIGSFSVSLFVNRFGRRNSMLLVNVLAF
|:| :| | ||:||:||||||||||||:|| :::: : :||: :||: ||||
A49158 LGRQGPEGPSSIPPGTLTTLWALSVAIFSVGGMISSFLIGIISQWLGRKRAMLVNNVLAV
70 80 90 100 110 120
. . . . . .
21
61Università di Torino
BLAST Overview
• BLAST uses similar intuition• It relies on high scoring matches rather
than exact matches• It is designed to find alignments of a target
string s against large databases
62Università di Torino
High-Scoring Pair
• Given parameters: length k, and thresholdT• Two strings s and t of length k are a high scoring
pair (HSP) if d(s,t) > T
• Given a query s[1..n], BLAST construct all wordsw, such that w is an HSP with a k-substring of s– Note that not all substrings of s are HSPs!
• These words serve as seeds for finding longermatches
63Università di Torino
Finding Potential Matches
We can locate seed words in a large databasein a single pass
• Construct a FSA that recognizes seed words• Using hashing techniques to locate
matching words
22
64Università di Torino
Extending Potential Matches
• Once a seed is found, BLAST attempts tofind a local alignment that extends the seed
• Seeds on the same diagonalare combined (as in FASTA)
s
t
65Università di Torino
BLAST programs
• BLASTN - Nucleotide query searching a nucleotidedatabase.
• BLASTP - Protein query searching a protein database.• BLASTX - Translated nucleotide query sequence (6
frames) searching a protein database.• TBLASTN - Protein query searching a translated
nucleotide (6 frames) database.• TBLASTX - Translated nucleotide query (6 frames)
searching a translated nucleotide (6 frames) database
66Università di Torino
BLAST Search
23
67Università di Torino
BLAST Output
• List of hits– Database accession codes, name, description.– Score in bits (Usually >30 bits is significant )– Expectation value E()
• For each hit– A header including hit name, description, length– Each hit may contain several HSPs– Score and expectation value
– how many identical residues– how many residues contributing positively to the score
• The local alignment itself
68Università di Torino
BLAST Output
69Università di Torino
BLAST Output
24
70Università di Torino
BLAST Output
71Università di Torino
What do we use
• Originally Blast did not allow gaps.– Now people use gapped-Blast– Gapped blast joins different diagonals.
• For proteins Blast is superior• For nucleotides Fasta is better.
84Università di Torino
Matrici di Punteggio
• L’allineamento dipende dal punteggioassegnato alle coppie di amminoacidi(matrici di punteggio) e dalla funzione dipenalizzazione degli indels
25
85Università di Torino
Matrici di punteggio
• Matrici di sostituzione amminoacidicaovvero Tavole di confronto tra simboli
• Esistono tavole per il confronto tra proteinee tra acidi nucleici
• Sono state pubblicate moltissime matricibasate su modelli diversi
86Università di Torino
Matrici di M. Dayhoff (1978)
• Percentuale di mutazioni accettate (PAM)• Questa famiglia di matrici è basata sulla
probabilità di sostituzione, durantel’evoluzione, di un amminoacido con unaltro in sequenze proteiche omologhe
• Ciascuna matrice misura i cambiamentiattesi per un dato periodo evolutivo
87Università di Torino
Matrici PAM
• Secondo questo modello, le sostituzioniamminoacidiche osservate lungo un certoperiodo di tempo, possono essereestrapolate per periodi più lunghi
• Nel calcolo delle matrici PAM, si assumeche il cambiamento ad un certo sito siaindipendente da precedenti eventimutazionali nello stesso sito
26
88Università di Torino
Calcolo di matrici PAM
• Basato su 1572 mutazioni in 71 gruppi disequenze simili almeno all’85%
• Le mutazioni non alteranosignificativamente la funzione delleproteine (mutazioni accettate)
• Le sequenze simili vengono organizzate inalberi filogenetici dai quali vengono desuntele mutazioni
89Università di Torino
Specie A A W T V A S A V R T S I
Specie B A Y T V A A A V R T S I
Specie C A W T V A A A V L T S I
A B C
L→ R
W→ Y
90Università di Torino
Calcolo di matrici PAM
• pi = ai/atot frequenza dell’amminoacido i
• fij = n(ai→aj) numero di mutazioni ai→aj
• fi = ∑j fij numero di mutazioni di ai
• f = ∑i fi numero totale di mutazioni
27
91Università di Torino
Calcolo di matrici PAM
• Si definisca la scala come il tempo evolutivonecessario per incorporare 1 amminoacido mutatosu 100:
1 PAM
• La mutabilità relativa di ai è:
mi = fi /100·f pi
92Università di Torino
Calcolo di matrici PAM
• Se mi è la probabilità di mutazione di ai , allora
Mii = 1 – miè la probabilità di conservazione di ai
• La probabilità della mutazione ai→aj è
Mij = (fij / fi) mi
93Università di Torino
Calcolo di matrici PAM
• La matrice Mij ottenuta è una matrice ditransizione
• In generale, per avere le probabilità per kintervalli evolutivi:
Mijk
• una delle matrici più utilizzate è PAM250
28
94Università di Torino
95Università di Torino
Matrice BLOSUM (Henikoff &Henikoff, 1992)
• Blocks Amino Acid Substitution Matrices= BLOSUM
• Basata sulle sostituzioni amminoacidicheosservate in ~2000 blocchi conservati disequenze.
• Questi blocchi sono stati estratti da unabanca dati di 500 famiglie di proteine
• Sono contati gli scambi amminoacidiciosservati in ciascuna colonna
96Università di Torino
Esempio di calcolo di matriceBLOSUM
...A...
...A...
...A...
...A...
...A...
...S...
...A...
...A...
...A...
...A... • 9 A e 1 S• 36 A→A (fAA) e 9 A→S
(fAS)• 210 possibili coppie di
amminoacidi• La frequenza di A→A è
qAA = fAA/(fAA + fAS) =0.8
• La frequenza di A→S èqAS = fAS/(fAA + fAS) =0.2
29
97Università di Torino
Esempio di calcolo di matriceBLOSUM
• La frequenza attesa che A sia coinvolta inuna coppia di mutazioni è pA = (qAA +qAS/2) = 0.9
• La frequenza attesa che S sia coinvolta inuna coppia di mutazioni è pS = (qAS/2) = 0.1
• La frequenza attesa di una coppia AA è eAA= pA
2 = 0.81• La frequenza attesa di una coppia AS è eAS
= 2 pA pS = 0.18
98Università di Torino
Esempio di calcolo di matriceBLOSUM
• Il valore per la coppia AA nella matrice èqAA/eAA = 0,99 e per AS è qAS/eAS = 1.11
• I valori sono convertiti in bits:sAA = log2(qAA/eAA) = -0.04sAS = log2(qAS/eAS) = 0,30
99Università di Torino
Calcolo di matrice BLOSUM
• Per bilanciare il sovracampionamento diresidui provenienti da sequenze moltosimili, le sequenze più simili di una certasoglia (per esempio 60% identità) sonoraggruppate e gli scambi amminoacidiciinterni al gruppo vengono mediati. Lamatrice risultante si chiama BLOSUM60
• La matrice più utilizzata è la BLOSUM62
30
100Università di Torino
101Università di Torino
The Hazard of Large Databases
• Define• This is the probability that two unrelated
sequences will match with score > εεεε bychance
• Assuming that they are independent of eachother, and all are unrelated to s, we have
)|),(( UtsdPp εε >=
εεε NpN
t eptsdP −−≈−−=> 1)1(1)),((max
102Università di Torino
Local Matching
• Question: Which local alignment query is expectedto give a higher score:– To a short sequence– To a long sequence?
• A local match can begin at any of the nm entries inthe DP matrix.
• The score is the optimal of all these starting points.• If all starting points were independent we would need
to calculate the probability of attaining such a scorein nm trails.
31
103Università di Torino
Score Significance-Fasta
• How meaningful is a score?• Calculate distribution of scores and related scores
• Under reasonable assumptions the scores for un-gapped alignment behave according to theExtreme Value Distribution
104Università di Torino
Extreme Value Distribution(BLAST)
• We ask the following questions: Given a databaseof size m and a sequence of size n
• What is the expected number of hits with score atleast S? This number is called an E-score
• Notice this is a Poisson distribution.• K corrects for the dependencies• λ depends on the scoring matrix• Doubling length of sequence doubles expectation• Doubling score causes E to decrease exponentially
SKmneSE λ−=)(
105Università di Torino
Blast P-value
• Recall Poisson distribution:– Probability of finding no hits with a score >= S
– Therefore probability of finding at least one hitwith score >= S is
– This is called the P-value.
Ee−
Ee−−1
32
106Università di Torino
Protein Families
• Consider Zinc Fingers:• All have the same function:
– Bind to DNA• All have similar structure• They constitute a Protein Family• In a protein family some parts of the
sequence (the functional parts) are moreconserved than others.
107Università di Torino
Multiple Alignment
• Proteins can be classified into families:– Common structure.– Common function.– Common evolutionary origin.
• For a set of sequences belonging to some family– Each pair has some differences– But, there are some common motifs in almost all sequences
of the family• A multiple alignment carries more information than
pairwise alignment
108Università di Torino
Definition
A multiple alignment of stringsS1,S2,…,Sk is a series of strings withblanks S’1,S’2,…,S’k such that:– |S’1|=|S’2|=…=|S’k|– S’j is an extension of Sj obtained by
insertion of blanks.
33
109Università di Torino
Example
AGT..CTT.ACGCG
AGTAGCTT...GCG
..TAGC.T..GGCG
.CTA.C.TAACCCG
ACTA...TAAC...
110Università di Torino
Example
111Università di Torino
Sum of Pairs
• The sum of pairwise distances between allpairs of sequences for some scoring matrix
• Not only assumes that alignment of eachcolumn is independent, but also each pair ofsequences.– Each sequence is scored as if descended from k-
1 sequences instead of one common ancestor.
),()( li
lk
kii mmsmS ∑
<
=
34
112Università di Torino
Calculation of MultipleAlignment
• The optimal alignment can be calculatedexactly using r-dimensional dynamicprogramming.– Space complexity O(nk)– Time complexity O(2knk)
• A Heuristic Program called ClustalWquickly finds a good multiple alignment.
113Università di Torino
The ClustalW Algorithm
• Three steps:– 1 Compare all pairs of sequences to obtain a
similarity matrix– 2 Based on the similarity matrix, make a guide
tree relating all the sequences– 3 Perform progressive alignment where the
order of the alignments is determined by theguide tree
114Università di Torino
ClustalW - Score of aligning twoalignment columns
• sum the score matrix entry for all pairs ofresidues
• weight each pair by the sequences’ weights
1: peeksavtal
2: geekaavlal
3: egewglvlhv
4: aaektkirsa
Score:M( t, v)+ M( t, i)+M( l, v)+ M( l, i)
35
115Università di Torino
ClustalW - Weighting sequences
• each sequence is given a weight• groups of related sequences receive lower
weight
1: peeksavtal
2: geekaavlal
3: egewglvlhv
4: aaektkirsa
Weighted score:w1* w3* M( t, v)+w1* s4* M( t, i)+
w2* w3* M( l, v)+w2* w4* M( l, i)
116Università di Torino
ClustalW - Similarity matrix
• Distance between sequences - measure fromthe guide tree - determines which matrix touse– 80- 100% seq- id -> use Blosum80– 60- 80% seq- id -> Blosum60– 30- 60% seq- id -> Blosum45– 0- 30% seq- id -> Blosum30
117Università di Torino
ClustalW - Gap penalties
• Initial gap penalty– GOP
• Gap extension penalty– GEP
GTEAKLIVLMANE
GA---------KL
Penalty: GOP+ 8* GEP
36
118Università di Torino
ClustalWModifications of gap penalty
• Position specific penalty– gap at position
• yes -> lower GOP– no, but gap within 8 residues -> increase GOP– hydrophilic residues
• lower GOP
119Università di Torino
ClustalW - summary
• Does not use a score for the final alignment• Each pairwise alignment is done using
dynamic programming• Heuristics (e. g., gap- penalty modifications)
are used - tailored to globular proteins• Graphical version: ClustalX
120Università di Torino
Creating a PSSM
• After aligning the sequences we see thatthere are some conserved regions.
• We use the multiple alignment of Blastresults to create a Position Specific ScoringMatrix.
• This matrix represents information from awhole family, it is more strict in highlyconserved regions.
37
121Università di Torino
PSI- BLAST (Position SpecificIterated)
• BLAST provides a new automatic “profile like” search.• Iterative procedure:
– Perform BLAST on database.– Use Significant alignments to construct a “position specific”
score matrix.– This matrix replaces the query sequence in the next round of
database searching.• The program may be iterated until no new significant
alignments are found.• Most commonly used search method today.
top related