class 5: multiple sequence alignment
DESCRIPTION
Class 5: Multiple Sequence Alignment. Multiple sequence alignment. VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- - PowerPoint PPT PresentationTRANSCRIPT
.
Class 5:Multiple Sequence
Alignment
Multiple sequence alignment
VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGTSSNIGS--ITVNWYQQLPG
LRLSCSSSGFIFSS--YAMYWVRQAPG
LSLTCTVSGTSFDD--YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG--
ATLVCLISDFYPGA--VTVAWKADS--
AALGCLVKDYFPEP--VTVSWNSG---
VSLTCLVKGFYPSD--IAVEWESNG--
Homologous residues are aligned together in columns Homologous - in the structural and evolutionary sense
Ideally, a column of aligned residues occupy similar 3d structural positions
Multiple alignment – why?
Identify sequence that belongs to a family Family – a collection of homologous, with similar
sequence, 3d structure, function or evolutionary history
Find features that are conserved in the whole family Highly conserved regions, core structural elements
The relation between the divergence of sequence and
structure
[Durbin p. 137, redrawn from data in Chothia and Lesk (1986)]
Scoring a multiple alignment (1)
Important features of multiple alignment: Some positions are more conserved than others Position specific scoring
Sequences are not independent (related by phylogenetic tree)
Ideally, specify a complete model of molecular sequence evolution
Scoring a multiple alignment (2)
Unfortunately, not enough data …
Assumption (1)Columns of alignment are statistically independent.
( ) ( )ii
S m G S m 1 2( , ,..., ) Column of alignment m
( ) Score for column
Gap scoring function
Ni i i i
i
m m m m i
S m i
G
Minimum entropy
Assumption (2)Symbols within columns are independent
Observed counts of ( )
symbol in column
The probability of
symbol in column
jia i
j
ia
c m aa i
Pa i
( )
( ) log
iaci ia
a
i ia iaa
P m P
S m c P
Entropy measure
Sum of pairs (SP)
Columns are scored by a “sum of pairs” function, using a substitution scoring matrix
Note:
( ) ( , )k li i i
k l
S m m m
log( ) log( ) log( ) log( )abc ab ac bc
a b c a b a c b c
P P P P
q q q q q q q q q
Multidimensional DP
( ) ( )ii
S m S m
Multidimensional DP
1 2 1 2
1 2 2
1 2 1
1 2 1 2 1 2
1 2 3
1 2 3
1 21, 1, , 1
2, 1, , 1
11, , , 1
1 2, , , 1, 1, ,
, , 1 , 1
, 1, 1 ,
( , , , )
( , , , )
( , , , )
max ( , , , )
( , , , )
N N
N N
N N
N N
N N
N
Ni i i i i i
Ni i i i i
Ni i i i i
i i i i i i i i
Ni i i i i
i i i i
x x x
x x
x x
x x
x
2
2( , , , )ix
Multidimensional DP
: 1
: 0i
ii
xx
1 2 1 1 2 2 1 2
1
1 2, , , , , , 1 2
0max ( , , , )
N N N NN
Ni i i i i i i i N ix x x
Complexity
Space: Time: 1
N
ii
O L
1
2N
Ni
i
O L
Pairwise projections of MA
MSA (i)
[Carrillo and Lipman, 1988]
pairwise alignment between sequences ,
optimal pairwise alignment of ,
( ) ( )
lower bound on the optimal ( )
multiple alignment score
( ) ( )
kl
kl
klkl
a k l
a k l
S a S a
l a
l a S a
MSA (ii)
' '
' '
' '
' '
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
( , ) s.t. the best alignment of
, through ( , ) scores
kl k lkl
k l
kl k lkl kl
k l
k lkl
k l klk l
l a S a S a S a
S a l a S a S a
i iB
x x i i
MSA (iii)
Algorithm sketch
1 2
( ) , ,
( , , , )
( , )
kl
kl
N
klk l
l a a
B
i i i
i i B
kl1. Calculate
2. Find
3. Use multidimensional
DP to evaluate only
cells for
which
Progressive alignment methods (i)
Basic idea: construct a succession of PW alignments
Variatoins: PW alignment order One growing alignment or subfamilies Alignment and scoring procedure
Progressive alignment methods (ii)
Most important heuristic – align the most similar pairs first.
Many algorithms build a “guide tree”: Leaves – sequence Interior nodes – alignments Root – complete multiple alignment
Feng-Doolittle (1987)
Calculate all pairwise distances using alignment scores:
Construct a guide tree using hierarchical clustering
Highest scoring pairwise alignment determines sequence to group alignment
log log obs randeff
max rand
S SD S
S S
Profile alignment
Use profiles for group to sequence and group to group alignments
CLUSTALW (Thompson et al., 1994): Similar to Feng-Doolittle, but uses profile alignment
methods Numerous heuristics
Iterative Refinement
Addresses “frozen” sub-alignment problem
Iteratively realign sequences or groups to a profile of the rest
Barton and Sternberg (1987) Align two most similar sequences Align current profile to most similar sequence Remove each sequence and align it to profile