Download - Tutorial 4 Comparing Protein Sequences
Tutorial 4Comparing Protein Sequences
Intro to Bioinformatics
1
Amino acids were not born equally
2
Comparing Protein Sequences
Substitution MatricesPAM - Point Accepted MutationsBLOSUM - Blocks Substitution Matrix
Advance comparison toolsPsi-BLASTPhi-BLAST
3
Substitution Matrix
Scoring matrix S20x20 for protein alignment (Amino-acid)
Si,j represents the gain/penalty due to substituting AAj by AAi (i – line , j – colomn)Based on likelihood this substitution is found in
natureComputed differently in PAM and BLOSUM
4
Computing probability of Mutation (Mi,j)
PAM - Point Accepted MutationsBased on closely related proteins (X% divergence)
Matrices for comparison of divergent proteins computed
BLOSUM - Blocks Substitution MatrixBased on conserved blocks bounded in similarity (at least X% identical)
Matrices for divergent proteins are derived using appropriate X%
5
PAM-1
Captures mutation rates between close proteins1% divergenceMi,j = AB / #A
Problematic when comparing far proteinsThe 1% divergence does not capture more sporadic mutations
PAM250 is theoretical (extrapolation based)
6
PAM-1
7
Captures mutation rates between divergent proteins
Why is BLOSUM62 called BLOSUM62? Basically, this is because all blocks whose members shared at least 62% identity with ANY other member of that block were averaged and represented as 1 sequence.
BLOSUM62
8
BLOSUM62
The idea of BLOSUM matrices is to get a better measure of differences between two proteins specifically for more distantly related proteins.
Similar AA have high score
9
PAM & BLOSUM
PAM BLOSUMBased on global alignments of closely related proteins.
Based on local alignments.
The PAM1 is calculated from comparisons of sequences with no more than 1% divergence.
BLOSUM 62 is calculated from comparisons of sequences with at least 62% identity in the blocks.
Other PAM matrices are extrapolated from PAM1.
All BLOSUM matrices are based on observed alignments. They are not extrapolated from comparisons of closely related proteins.
10
PAM100 ~ BLOSUM90 Closely RelatedPAM120 ~ BLOSUM80PAM160 ~ BLOSUM60 PAM200 ~ BLOSUM52PAM250 ~ BLOSUM45 Highly Divergent
Query length Matrix Gap costs
<35 PAM30 9,1
35-50 PAM70 10,1
50-85 BLOSUM80 10,1
>85 BLOSUM62 11,1
Use Recommendations
11
ExampleQuery: >ADRM1_HUMAN
(Proteasomal ubiquitin receptor)Data Base: nr on Human genome.Blast Program: BLASTPMatrices: PAM30,BLOSUM45
12
PAM 30 BLOSUM45
•With BLOSUM45 we found related and divergent sequences.
•With PAM30 we found only related sequences.
What difference do we observe?
13
PAM 30
BLOSUM45
With BLOSUM45 we can discover interesting relations between proteins
...
Mucin-13:a glycosylated membrane protein that protects the cell by binding to pathogens
14
With PAM 30
With BLOSUM45
Using different scoring matrices can produce slightlyDifferent alignments:
15
A same alignment can be solved in many ways, specially when using a matrix for highly divergent sequences (BLOSUM45):
16
PSI-BLAST
Position Specific Iterative BLAST
We will analyze the following Archeal uncharacterized protein: >gi|2501594|sp|Q57997|Y577_METJA PROTEIN MJ0577MSVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIKKRDIFSLLLGVAGLNKSVEEFENELKNKLTEEAKNKMENIKKELEDVGFKVKDIIVVGIPHEEIVKIAEDEGVDIIIMGSHGKTNLKEILLGSVTENVIKKSNKPVLVVKRKNS
17
18
Threshold for initial BLAST
Search (default:10)
Threshold for inclusion in PSI-BLAST iterations
(default:0.005)
19
The query itself
Orthologous sequences in two other archaeal species
Other homologous sequences
20
21
...
...
...
Is MJ0577 a filament protein?
Is MJ0577 a cationic amino
transporter?
Is MJ0577 a universal stress
protein?22