recent advances in protein sequence analysis

78

Upload: wylie

Post on 12-Jan-2016

16 views

Category:

Documents


0 download

DESCRIPTION

Recent Advances in Protein Sequence Analysis. Nick V. Grishin. Howard Hughes Medical Institute, Department of Biochemistry, University of Texas Southwestern Medical Center at Dallas. Assembling a toolbox for analysis of protein molecules. History tour. How did it all start?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Recent Advances in Protein Sequence Analysis
Page 2: Recent Advances in Protein Sequence Analysis

Recent Advances in Protein Sequence Analysis

Nick V. Grishin

Howard Hughes Medical Institute, Department of Biochemistry,

University of Texas Southwestern Medical Center at Dallas

Page 3: Recent Advances in Protein Sequence Analysis

Assembling

a toolbox for analysis of

protein molecules

Page 4: Recent Advances in Protein Sequence Analysis

History tour.

How did it all start?

Question 1: Why is it that educated people (=experts) can understand biological phenomena so much better than computers?

Question 2: Why is it that those experts are so-o-o slow at what they do best?

Question 3: Why can’t these experts teach computers to do the job right?

Page 5: Recent Advances in Protein Sequence Analysis

History tour.

How did it all start?

Question 1: Why is it that educated people (=experts) can understand biological phenomena so much better than computers?

Question 2: Why is it that those experts are so-o-o slow at what they do best?

Question 3: Why can’t these experts teach computers to do the job right?

Maybe they don’t

Lazy?

Snobbish?

Page 6: Recent Advances in Protein Sequence Analysis

We think we are experts.

We are trying to teach computers to give correct answers – and it is hard!

Page 7: Recent Advances in Protein Sequence Analysis

We think we are experts.

We are trying to teach computers to give correct answers – and it is hard!

Answers to what questions?

YOU KNOW …

- we have a protein sequence – what is it’s 3D structure?- we have a protein 3D structure – where is the functional site?- we have 2 sequences – what is their alignment?- we have many related sequences – what is the tree?etc. etc. etc.

Page 8: Recent Advances in Protein Sequence Analysis

Universal law of science: cost for an increment in improvement increases exponentially with the improvement

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20000 40000 60000 80000 100000 120000

Cost ($)

Imp

rov

eme

nt

(5 –

clo

se

to

pe

rfe

ct)

Page 9: Recent Advances in Protein Sequence Analysis

Universal law of science: cost for an increment in improvement increases exponentially with the improvement

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20000 40000 60000 80000 100000 120000

Cost ($)

Imp

rov

eme

nt

(5 –

clo

se

to

per

fec

t)

first 50% with <1% cost

That’s where many researchers stop

e.g. BLAST

Page 10: Recent Advances in Protein Sequence Analysis

Universal law of science: cost for an increment in improvement increases exponentially with the improvement

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20000 40000 60000 80000 100000 120000

Cost ($)

Imp

rov

eme

nt

(5 –

clo

se

to

per

fec

t)

first 80% with <20% cost

That’s where most researchers stop

e.g. PSI-BLAST

Page 11: Recent Advances in Protein Sequence Analysis

Our Zone

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20000 40000 60000 80000 100000 120000

Cost ($)

Imp

rov

eme

nt

(5 –

clo

se

to

per

fec

t)

Get it close to right!

Why? Things in the most difficult zone are most interesting and they are

most unexpected.

Science to us is funding the unexpected

Page 12: Recent Advances in Protein Sequence Analysis

Why do we need many tools?

Evolution

Sequence Structure

Function

Page 13: Recent Advances in Protein Sequence Analysis

Why do we need many tools?

Evolution

Sequence Structure

Function

Structure prediction

Evolutionary tree reconstruction

Function prediction

Page 14: Recent Advances in Protein Sequence Analysis

Main tools in the toolbox

Function prediction tools:- Prediction of functional sites

- universally important sites;- functional specificity sites.

- Evolutionary tree and ancestral sequence reconstruction.

Structure analysis tools:- Secondary structure delineation;- Pattern-matching structure similarity search;- Structure alignment.

Sequence analysis tools:- Alignment of alignments and alignment similarity search;

- plain sequence alignments;- alignments with predicted sec.str.

- Multiple sequence alignment;- Sequence space visualization.

Page 15: Recent Advances in Protein Sequence Analysis

Today’s agenda

1. COMPASS: Search for similarity between families

2. PROMALS: Multiple sequence alignment

Page 16: Recent Advances in Protein Sequence Analysis

1. COMPASS: Search for similarity between families

Ruslan Sadreyev

Page 17: Recent Advances in Protein Sequence Analysis

Comparison of multiple alignments improves similarity detection

Sequence-sequence (e.g. BLAST)

QGVEGPKPAIKLRA

Alignment-sequence (e.g. PSI-BLAST)

RVAGMKPRFVRSVKIVHRvs.

Alignment-alignment (e.g. COMPASS)

RVAGMKPRFVRSVKIVHRIIRASKPKFTRSVTI-HRQLVGSKPKFTRTLVT-HR

vs.

RVAGMKPRFVRSVKIVHR

vs.

QGVEGPKPAIKLRAEGLEGPASRFRVTVKKVDGPPV-SRMTT

QGVEGPKPAIKLRAEGLEGPASRFRVTVKKVDGPPV-SRMTT

Page 18: Recent Advances in Protein Sequence Analysis

COMPASS web serverhttp://prodata.swmed.edu/compass

COMPASS: a method for

COmparison of Multiple Protein

Alignments with assessment of

Statistical Significance

Sadreyev and Grishin (2003) JMB, 326: 317

Page 19: Recent Advances in Protein Sequence Analysis

Recent changes: 2007

1. New random model for profiles

2. New distribution to describe scores

Page 20: Recent Advances in Protein Sequence Analysis

Estimates of statistical significance are based on a random model of alignment comparison

Random model

Random decoy profiles

S1

S2

S3

Score distribution

Score S

E-value

freq

uen

cy

Page 21: Recent Advances in Protein Sequence Analysis

Old random model

Independent positions: shuffling positions makes decoy alignments

This model works very well in BLAST and PSI-BLAST,

however, maybe more realistic models work better

Page 22: Recent Advances in Protein Sequence Analysis

Reproducing protein features:

Real secondary structure elements are used as building blocks for decoy MSA

Real MSAs

MSA fragments Decoy MSAs

Page 23: Recent Advances in Protein Sequence Analysis

Estimates of statistical significance are based on a random model of alignment comparison

Random model

Random decoy profiles from SS

S1

S2

S3

Score distribution

Score S

E-value

freq

uen

cy

Page 24: Recent Advances in Protein Sequence Analysis

Score

freq

uen

cy

Distribution of scores for random MSA comparison

Describe empirical distribution with a continuous density function

Page 25: Recent Advances in Protein Sequence Analysis

Gumbel Extreme Value Distribution (EVD)is traditionally used to describe similarity cores

1( ) exp( )x m

s x mf x C e

s

EVD pdf:

m: location parameter

Score, x~m

~s

s: scale parameterfr

equ

ency

Page 26: Recent Advances in Protein Sequence Analysis

EVD does not fit empirical score distributions

Score

freq

uen

cy

1( ) exp( )x m

s x mf x C e

s

Page 27: Recent Advances in Protein Sequence Analysis

EVD does not fit empirical score distributions

Score

freq

uen

cy

1( ) exp( )x m

s x mf x C e

s

Page 28: Recent Advances in Protein Sequence Analysis

EVD does not fit empirical score distributions

Score

freq

uen

cy

1( ) exp( )x m

s x mf x C e

s

Page 29: Recent Advances in Protein Sequence Analysis

For data generated from the same distribution,fitting P-values are distributed uniformly

EVD

Score

pdf

0.0 0.2 0.4 0.6 0.8 1.00.00

0.04

0.08

0.12

P-values for EVD fits

Fre

qu

ency

Page 30: Recent Advances in Protein Sequence Analysis

Scores generated by SS-based modeldo not obey other standard statistical distributions

Distributions of Pearson system

Distributions of Johnson system

Inverse Gaussian (Wald) distribution

Burr

Weibul

Tukey (lambda)

Non-central chi square

Non-central t

2 goodness-of-fit

does not pass

P-values <~ 10-5

Page 31: Recent Advances in Protein Sequence Analysis

We had to invent a new distribution

1( ) exp( )x m

s x mf x C e

s

EVD pdf:

How?

Modify EVD!

2( ) exp( )x m

s x mf x C e

s

Power EVD pdf:

WOW!

Page 32: Recent Advances in Protein Sequence Analysis

A new distribution, power EVD (PEVD),is created by modification of EVD

1( ) exp( )x m

s x mf x C e

s

EVD pdf:

2( ) exp( )x m

s x mf x C e

s

PEVD pdf:

m: location parameter

Score, x~m

~s ~α,β

s: scale parameter

: shape parameters

freq

uen

cy

Page 33: Recent Advances in Protein Sequence Analysis

Power EVD precisely fits empirical score distributions

Score

freq

uen

cy

2( ) exp( )x m

s x mf x C e

s

Page 34: Recent Advances in Protein Sequence Analysis

The new random model + new distributionimprove homology detection

Less significant E-value

True Positive

False Positive

Query:

Database hits:

Page 35: Recent Advances in Protein Sequence Analysis

Benchmark: 2900 PSI-BLAST alignments for SCOP domain representativeswith known relationships

Less significant E-value

True Positive

False Positive

Query:

Database hits:ROC curve

The new random model + new distributionimprove homology detection

Page 36: Recent Advances in Protein Sequence Analysis

Summary

• We developed a realistic random model that simulates random MSA comparison by mimicking native protein secondary structure

• We developed a precise analytical approximation of the simulated score distributions, based on a new distribution function, PEVD

• Applied to protein similarity searches, the new model produces more realistic E-values and (unexpectedly) improves homology detection

Page 37: Recent Advances in Protein Sequence Analysis

2. Towards accurate

multiple sequence alignments

of distantly related proteins

Jimin Pei

Page 38: Recent Advances in Protein Sequence Analysis

Multiple sequence alignment

BSUB00 RMAHYDSLTDLPNRRHAISHLTKVLNREHSLHYNTVVFFLDLNRFKVINDAL ECU738 VMSTRDGMTGVYNRRHWETMLRNEFDNCRRHNRDATLLIIDIDHFKSINDTW D90790 HEVGMDVLTKLLNRRFLPTIFKREIAHANRTGTPLSVLIIDVDKFKEINDTW SYCSLL QISSLDALTQVGNRYLFDSTLEREWQRLQRIREPLALLLCDVDFFKGFNDNY ECAE00 NIAHRDPLTNIFNRNYFFNEL--TVQSASAQKTPYCVMIMDIDHFKKVNDTW AF0348 QAANVDSLTGLANRAAYNAHM-ERLTAADAPS--IGLLLIDVDRLKQVNDIL D90796 IRSNMDVLTGLPGRRVLDESFDHQLRNAEPLN--LYLMLLDIDRFKLVNDTY Y4LL_R HMARHDALTGLPNRQFLREEF-ERLSDHIAPSTRLAILCLDLDGFKAINDAY Y07I_M YLADHDDLTGLHNRRALLQHLDQRLAPGQPGP--VAALFLDLDRLKAINDYL ……

Page 39: Recent Advances in Protein Sequence Analysis

Multiple sequence alignment

BSUB00 RMAHYDSLTDLPNRRHAISHLTKVLNREHSLHYNTVVFFLDLNRFKVINDAL ECU738 VMSTRDGMTGVYNRRHWETMLRNEFDNCRRHNRDATLLIIDIDHFKSINDTW D90790 HEVGMDVLTKLLNRRFLPTIFKREIAHANRTGTPLSVLIIDVDKFKEINDTW SYCSLL QISSLDALTQVGNRYLFDSTLEREWQRLQRIREPLALLLCDVDFFKGFNDNY ECAE00 NIAHRDPLTNIFNRNYFFNEL--TVQSASAQKTPYCVMIMDIDHFKKVNDTW AF0348 QAANVDSLTGLANRAAYNAHM-ERLTAADAPS--IGLLLIDVDRLKQVNDIL D90796 IRSNMDVLTGLPGRRVLDESFDHQLRNAEPLN--LYLMLLDIDRFKLVNDTY Y4LL_R HMARHDALTGLPNRQFLREEF-ERLSDHIAPSTRLAILCLDLDGFKAINDAY Y07I_M YLADHDDLTGLHNRRALLQHLDQRLAPGQPGP--VAALFLDLDRLKAINDYL ……

Family A

Family BFamily C

Active site prediction experimental design

Phylogenetic analysisProtein similarity search and classification

Structure modeling

Page 40: Recent Advances in Protein Sequence Analysis

SKVIGWRPGEKVIGWTGDKICGWGVKARIVAYPGGTRLISYPRTGKUnaligned sequences

• Homologous• Structurally equivalent• Similar function

Meaning of alignments

• Homologous• Structurally equivalent• Similar function

• Homologous• Structurally equivalent• Similar function

SKVIGWR-PGE-KVIGWT--GD-KICGWG--VKARIVAYP-GGT-RLISYPRTGK

Position in an alignment

Page 41: Recent Advances in Protein Sequence Analysis

How is the alignment made?

ClustalW – the most widely used alignment program

Page 42: Recent Advances in Protein Sequence Analysis

ClustalW – the most widely used program

Thompson et al. (1994). http://www.ch.embnet.org/software/ClustalW.html

Page 43: Recent Advances in Protein Sequence Analysis

How accurate are these alignments?

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1

ClustalW accuracy

Page 44: Recent Advances in Protein Sequence Analysis

How accurate are these alignments?

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1

ClustalW accuracy PROMALS accuracy

About 3 times better than ClustalW

Page 45: Recent Advances in Protein Sequence Analysis

PROMALS:

(PROfile Multiple Alignment with

predicted Local Structure)

Page 46: Recent Advances in Protein Sequence Analysis

http://prodata.swmed.edu/promals

Page 47: Recent Advances in Protein Sequence Analysis

What did we do to achieve this?

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1

ClustalW accuracy PROMALS accuracy

About 3 times better than ClustalW

Page 48: Recent Advances in Protein Sequence Analysis

First of all,

ClustalW is not that bad …

Page 49: Recent Advances in Protein Sequence Analysis

ClustalW good alignment

Q2BMK3 MLAKTVREQIQRPADLVARYGGEEFIVVLPDTDEEGAMAVAGQICVAVAS Q3A8D4 QVARMLQSVVARPGDLVARYGGEEFALILPQTD-HGAKFLGESCRAAVAG Q36PG9 ALASILSDEVQRSGDLVARYGGEEFAILLPTTDVAGAQQVAERMRLSVAR Q2BQL8 TVAQTIKHSIQRAQDMVCRYGGEEFVVILPETDLDGAQMIAERIRKAIAK Q3XUK3 ALAHTISL-HLRPGDIAARYGGEEFAVVLPDTDAVSGRMIAERLRTAVEA Q9HXT9 QVAGAIREGCSRSSDLAARYGGEEFAMVLPGTSPGGARLLAEKVRRTVES P73713 TIGRILQSNIRGS-DIACRYGGEEMTIVLPQTSLEDTLVKAESLRQAIAS Q36SI5 MVGDVLATCFRGS-DTVCRYGGEEFSVLMPGASLDEARQRAEQLRAAISA Q747B7 EAAAVFRGCIRTS-DIAARYGGEEFVVIMPETTRELALLAAEKLRRAVEE Q2DK38 KTADIIKASLRDM-DIVARYGGEEFCAILPGTSKKESIVVAERIRVGIEK

… for similar sequences

Page 50: Recent Advances in Protein Sequence Analysis

ClustalW good alignment

Q2BMK3 MLAKTVREQIQRPADLVARYGGEEFIVVLPDTDEEGAMAVAGQICVAVAS Q3A8D4 QVARMLQSVVARPGDLVARYGGEEFALILPQTD-HGAKFLGESCRAAVAG Q36PG9 ALASILSDEVQRSGDLVARYGGEEFAILLPTTDVAGAQQVAERMRLSVAR Q2BQL8 TVAQTIKHSIQRAQDMVCRYGGEEFVVILPETDLDGAQMIAERIRKAIAK Q3XUK3 ALAHTISL-HLRPGDIAARYGGEEFAVVLPDTDAVSGRMIAERLRTAVEA Q9HXT9 QVAGAIREGCSRSSDLAARYGGEEFAMVLPGTSPGGARLLAEKVRRTVES P73713 TIGRILQSNIRGS-DIACRYGGEEMTIVLPQTSLEDTLVKAESLRQAIAS Q36SI5 MVGDVLATCFRGS-DTVCRYGGEEFSVLMPGASLDEARQRAEQLRAAISA Q747B7 EAAAVFRGCIRTS-DIAARYGGEEFVVIMPETTRELALLAAEKLRRAVEE Q2DK38 KTADIIKASLRDM-DIVARYGGEEFCAILPGTSKKESIVVAERIRVGIEK

… for similar sequences

Page 51: Recent Advances in Protein Sequence Analysis

1w25 ------NRRYMTGQLDSLVKRATLGGDPVSALL-------------IDIDFFKKINDTFGHDIGDEV-------LREFALRLAS 1wc4 -PEPRLITILFSDIVGFTRMSNALQSQGVAELLNEYLGEMTRAVFENQGTVDKFVGDAIMALYGAPEEMSPSEQVRRAIATARQ 1w25 NVRAI-DLPCRYGGEE-----------FVVIMPDTALADALRI-AERIRMHVSGSPFTVAHGREML--NVTISIGVSATAGEGD 1wc4 MLVALEKLNQGWQERGLVGRNEVPPVRFRCGIHQGMAVVGLFGSQERSDFTAIGPSVNIAARLQEATAPNSIMVSAMVAQYVPD 1w25 TPEALLKRADEGVYQAKASGRNAVVGKAA-- 1wc4 E-----EIIKREFLELKGIDEPVMTCVINPN sequence identity = 12%

Here are distantly related sequences: diguanylate cyclase

and adenylate cyclaseClustalW alignment

1w25 NRRYMTGQLDSLVKRATLGGDPVSALLIDIDFFKKINDTFGHDIGDEVLREFALRLASNVRA-IDLP-CRYGGEEFVVIMPDT- 1wc4 -------------PEPR----LITILFSDIVGFTRMSNALQSQGVAELLNEYLGEMTRAVFENQGTVDKFVG-DAIMALYGAPE 1w25 ------ALADALRIAERIRMHVSG-SPFTVAHGREMLN------VTISIGVSATAGEGDT----------PEALLKRADEGVYQ 1wc4 EMSPSEQVRRAIATARQMLVALEKLNQGW-QERGLVGRNEVPPVRFRCGIHQGMAVVGLFGSQERSDFTAIGPSVNIAARLQEA 1w25 AKASGRNAVVGKAA--------------------------------- 1wc4 TA---PNSIMVSAMVAQYVPDEEIIKREFLELKGIDEPVMTCVINPN

sequence identity = 12%

DALI alignment based on structural comparison

1. Pei and Grishin 2001 2. Steegborn et al. 2005 3.Holm and Sander 1998

Page 52: Recent Advances in Protein Sequence Analysis

1. ClustalW alignment

2. DALI alignment based on structural comparison

1w25 ------NRRYMTGQLDSLVKRATLGGDPVSALL-------------IDIDFFKKINDTFGHDIGDEV-------LREFALRLAS 1wc4 -PEPRLITILFSDIVGFTRMSNALQSQGVAELLNEYLGEMTRAVFENQGTVDKFVGDAIMALYGAPEEMSPSEQVRRAIATARQ 1w25 NVRAI-DLPCRYGGEE-----------FVVIMPDTALADALRI-AERIRMHVSGSPFTVAHGREML--NVTISIGVSATAGEGD 1wc4 MLVALEKLNQGWQERGLVGRNEVPPVRFRCGIHQGMAVVGLFGSQERSDFTAIGPSVNIAARLQEATAPNSIMVSAMVAQYVPD 1w25 TPEALLKRADEGVYQAKASGRNAVVGKAA-- 1wc4 E-----EIIKREFLELKGIDEPVMTCVINPN

1w25 NRRYMTGQLDSLVKRATLGGDPVSALLIDIDFFKKINDTFGHDIGDEVLREFALRLASNVRA-IDLP-CRYGGEEFVVIMPDT- 1wc4 -------------PEPR----LITILFSDIVGFTRMSNALQSQGVAELLNEYLGEMTRAVFENQGTVDKFVG-DAIMALYGAPE 1w25 ------ALADALRIAERIRMHVSG-SPFTVAHGREMLN------VTISIGVSATAGEGDT----------PEALLKRADEGVYQ 1wc4 EMSPSEQVRRAIATARQMLVALEKLNQGW-QERGLVGRNEVPPVRFRCGIHQGMAVVGLFGSQERSDFTAIGPSVNIAARLQEA 1w25 AKASGRNAVVGKAA--------------------------------- 1wc4 TA---PNSIMVSAMVAQYVPDEEIIKREFLELKGIDEPVMTCVINPN

Red: alpha-helix blue: beta-strand

-helix aligned to -strand!

Accuracy of the above ClustalW alignment:0%

Page 53: Recent Advances in Protein Sequence Analysis

ClustalW superposition

Alignment-based structural superposition

1w25 ------NRRYMTGQLDSLVKRATLGGDPVSALL-------------IDIDFFKKINDTFGHDIGDEV-------LREFALRLAS 1wc4 -PEPRLITILFSDIVGFTRMSNALQSQGVAELLNEYLGEMTRAVFENQGTVDKFVGDAIMALYGAPEEMSPSEQVRRAIATARQ 1w25 NVRAI-DLPCRYGGEE-----------FVVIMPDTALADALRI-AERIRMHVSGSPFTVAHGREML--NVTISIGVSATAGEGD 1wc4 MLVALEKLNQGWQERGLVGRNEVPPVRFRCGIHQGMAVVGLFGSQERSDFTAIGPSVNIAARLQEATAPNSIMVSAMVAQYVPD 1w25 TPEALLKRADEGVYQAKASGRNAVVGKAA-- 1wc4 E-----EIIKREFLELKGIDEPVMTCVINPN

1. ClustalW alignment

Page 54: Recent Advances in Protein Sequence Analysis

ClustalW superposition

Alignment-based structural superposition

Page 55: Recent Advances in Protein Sequence Analysis

ClustalW superposition DALI superposition

Alignment-based structural superposition

1w25 NRRYMTGQLDSLVKRATLGGDPVSALLIDIDFFKKINDTFGHDIGDEVLREFALRLASNVRA-IDLP-CRYGGEEFVVIMPDT- 1wc4 -------------PEPR----LITILFSDIVGFTRMSNALQSQGVAELLNEYLGEMTRAVFENQGTVDKFVG-DAIMALYGAPE 1w25 ------ALADALRIAERIRMHVSG-SPFTVAHGREMLN------VTISIGVSATAGEGDT----------PEALLKRADEGVYQ 1wc4 EMSPSEQVRRAIATARQMLVALEKLNQGW-QERGLVGRNEVPPVRFRCGIHQGMAVVGLFGSQERSDFTAIGPSVNIAARLQEA 1w25 AKASGRNAVVGKAA--------------------------------- 1wc4 TA---PNSIMVSAMVAQYVPDEEIIKREFLELKGIDEPVMTCVINPN

Page 56: Recent Advances in Protein Sequence Analysis

ClustalW superposition DALI superposition

Alignment-based structural superposition

Page 57: Recent Advances in Protein Sequence Analysis

ClustalW alignment accuracy

0.21

0.36

0.57

0.80

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0-10% 10-15% 15-20% 20-40%

Identity range of alignments

Tests on 1785 domain pairs from SCOP (Murzin A. et al. 1995) database.

Al ig

nm

en

t acc

ura

cy

Page 58: Recent Advances in Protein Sequence Analysis

0.33

0.52

0.73

0.21

0.36

0.57

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0-10% 10-15% 15-20%Identity range

ClustalW (Thompson J. et al. 1994)

MUSCLE (Edgar R. 2004)

ProbCons (Do C. et al. 2005)

MAFFT (Kotoh K. et al. 2005)

MUMMALS (Pei and Grishin 2006)

What about other methods?A

l ign

men

t acc

ura

cy

Page 59: Recent Advances in Protein Sequence Analysis

Why do we care about remote homologs,

i.e. alignments of sequence pairs with identity less than 20% ?

Page 60: Recent Advances in Protein Sequence Analysis

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

2 11 20 29 38 47 56 65 74 83 92

sequence identity

nu

mb

er

of

pa

irs

Sequence identity distribution for proteins with significant structural similarity (Dali Z-score >7.0) in

FSSP1 database

1. Holm and Sander, 1996

Why do we care about remote homologs? Reason 1

Page 61: Recent Advances in Protein Sequence Analysis

Distant homologs help prediction of functional residues

• RCEs are CAAX prenyl proteases identified in eukaryotes. (Dolence et al. 2000)

H. sapiens VAHFHHI LCHSFC M. musculus VAHFHHI LCHSFC D. melanogaster VAHLHHI LVHAFC S. cerevisiae LAHAHHA ILHALC S. pombe MAHIHHT LVHAFC

•Computational methods identified distant homologs of RCEs in many bacteria. (Pei and Grishin, 2001 )

B. halodurans LVHFRYL FAHFCI B. subtilis ALHFRYL TAHFII A. aeolicus SAHLAYW FAHFSA L. plantarum LAHLVNI MLHFLD L. plantarum AMHLVNL SVHWLI B. anthracis LFHTSQ- AIHVLN P. aeruginosa ALHLLVN LLHASI V. cholerae MAHFAGG GVHFLF

Motif 1 Motif 2

• Recent mutagenesis studies confirmed our predictions. (Plummer et al. 2005) ▲ mutations result in complete loss of activity

▲ mutations do not affect enzyme activity

▲ ▲▲▲ ▲

Why do we care about remote homologs? Reason 2

Page 62: Recent Advances in Protein Sequence Analysis

Our goal (for a few years) has been

to improve alignment quality of distantly related sequences

Page 63: Recent Advances in Protein Sequence Analysis

PROMALS – PROfile Multiple Alignment with predicted Local Structure

1. Predicted secondary structure

2. Homologous sequences from database searches

3. Complex but reasonable probabilistic models

PROMALS input: unaligned protein sequences

PROMALS output: multiple sequence alignment

PROMALS algorithm builds alignment of distantly related

sequences by utilizing three main sources:

Page 64: Recent Advances in Protein Sequence Analysis

1w25 NRRYMTGQLDSLVKRATLGGDPVSALLIDIDFFKKINDTFGHDIGDEVLREFALRLASNVRA-IDLP-CRYGGEEFVVIMPDT- 1wc4 -------------PEPR----LITILFSDIVGFTRMSNALQSQGVAELLNEYLGEMTRAVFENQGTVDKFVG-DAIMALYGAPE 1w25 ------ALADALRIAERIRMHVSG-SPFTVAHGREMLN------VTISIGVSATAGEGDT----------PEALLKRADEGVYQ 1wc4 EMSPSEQVRRAIATARQMLVALEKLNQGW-QERGLVGRNEVPPVRFRCGIHQGMAVVGLFGSQERSDFTAIGPSVNIAARLQEA 1w25 AKASGRNAVVGKAA--------------------------------- 1wc4 TA---PNSIMVSAMVAQYVPDEEIIKREFLELKGIDEPVMTCVINPN

sequence identity = 12%

DALI structural alignment colored by real secondary structures

Secondary structure is more conserved than sequence

DALI structural alignment colored by PSIPRED1 predicted secondary structures1w25 NRRYMTGQLDSLVKRATLGGDPVSALLIDIDFFKKINDTFGHDIGDEVLREFALRLASNVRA-IDLP-CRYGGEEFVVIMPDT- 1wc4 -------------PEPR----LITILFSDIVGFTRMSNALQSQGVAELLNEYLGEMTRAVFENQGTVDKFVG-DAIMALYGAPE 1w25 ------ALADALRIAERIRMHVSG-SPFTVAHGREMLN------VTISIGVSATAGEGDT----------PEALLKRADEGVYQ 1wc4 EMSPSEQVRRAIATARQMLVALEKLNQGW-QERGLVGRNEVPPVRFRCGIHQGMAVVGLFGSQERSDFTAIGPSVNIAARLQEA 1w25 AKASGRNAVVGKAA--------------------------------- 1wc4 TA---PNSIMVSAMVAQYVPDEEIIKREFLELKGIDEPVMTCVINPN

1. Jones 1999

Secondary structure prediction is about 80% accurate

Source #1: secondary structure

Page 65: Recent Advances in Protein Sequence Analysis

More homologs bring up important sequence

features through averaging Q2BMK3 MLAKTVREQIQRPADLVARYGGEEFIVVLPDTDEEGAMAVAGQICVAVAS

Source #2: homologous sequences

Page 66: Recent Advances in Protein Sequence Analysis

More homologs bring up important sequence

features through averaging

A profile derived from multiple sequence alignment contains position-specific information about:

(1) amino acid usage

(2) amino acid conservation

Q2BMK3 MLAKTVREQIQRPADLVARYGGEEFIVVLPDTDEEGAMAVAGQICVAVAS

Q2BMK3 MLAKTVREQIQRPADLVARYGGEEFIVVLPDTDEEGAMAVAGQICVAVAS Q3A8D4 QVARMLQSVVARPGDLVARYGGEEFALILPQTD-HGAKFLGESCRAAVAG Q36PG9 ALASILSDEVQRSGDLVARYGGEEFAILLPTTDVAGAQQVAERMRLSVAR Q2BQL8 TVAQTIKHSIQRAQDMVCRYGGEEFVVILPETDLDGAQMIAERIRKAIAK Q3XUK3 ALAHTISL-HLRPGDIAARYGGEEFAVVLPDTDAVSGRMIAERLRTAVEA Q9HXT9 QVAGAIREGCSRSSDLAARYGGEEFAMVLPGTSPGGARLLAEKVRRTVES P73713 TIGRILQSNIRGS-DIACRYGGEEMTIVLPQTSLEDTLVKAESLRQAIAS Q36SI5 MVGDVLATCFRGS-DTVCRYGGEEFSVLMPGASLDEARQRAEQLRAAISA Q747B7 EAAAVFRGCIRTS-DIAARYGGEEFVVIMPETTRELALLAAEKLRRAVEE Q2DK38 KTADIIKASLRDM-DIVARYGGEEFCAILPGTSKKESIVVAERIRVGIEK

Cyan: invariant position Yellow: hydrophobic position Blue: small residues

Add

ition

al

hom

olog

s

Source #2: homologous sequences

Page 67: Recent Advances in Protein Sequence Analysis

Statistical models of profile-profile alignment

Predicted SS: hhhhhhhhhhhhhc ccceeeeecceeeeeeccc ... LKVISNRLLALVHP-EDAVCRLGGDEFALILNHT LVEIAGRIRSIAKD-DYVLSRSGGDEFVVVVPDC LVEVSERLQRALRQ-TDTVARLGGDEFLIILDQV LLYIGERVQAAVGE-QGQTFRRGGNEFVVLLPAV LRHVTERLRNFLKQ-SDILCRLSGDEFVVLRVGI LKYVASEIIKNIRK-TDCAVRFGGDEILVAFPDT LKDIARIIRESIRG-TDIAVRIGGDEFLIILPNS Seq1: LVRISAAIRDAVRS-RDIVVRYGGEEFLVLLTHV Hidden states: MMMMMMMMMMMMMMYMMMMMMMMMMMMMMMMMXX Seq2: LNEFFRVVVDTVGRHGGFVNKFQGDAALAIFG-- LDNHDTIVCHEIQRFGGREVNTAGDGFVATFT-- LNELFARFDKLAAENHCLRIKILGDCYYCVSG-- LNSMYSKFDRLTSVHDVYKVETIGDAYMVVGG-- LNIYFGKMADVITHHGGTIDEFMGDGILVLFG-- IKTHNDIMRRQLRIYGGYEVKTEGDAFMVAFP-- LNEYMSCMVDCIEQTGGVVDKFIGDAIMAIWG-- ... Predicted SS: hhhhhhhhhhhhhhhcceeeeeecceeeeeec

Added homologs

Added homologs

X

M

Y

M: emit an aligned position pairX: emit a position in first profileY: emit a position in second profile

Hidden Markov model

Source #3: logical statistical models

Page 68: Recent Advances in Protein Sequence Analysis

PROMALS alignment example: diguanylate cyclase and adenylate cyclase

1w25 NRRYMTGQLDSLVKRATLGGDPVSALLIDIDFFKKINDTFGHDIGDEVLREFALRLASNVRAI-DLPCRYGGEEFVVIMP---- 1wc4 -----------------PEPRLITILFSDIVGFTRMSNALQSQGVAELLNEYLGEMTRAVFENQGTVDKFVGDAIMALYGAPEE 1w25 ---DTALADALRIAERIRMHVSGSPFTVAHG-----REMLNVTISIGVSAT----AGE-------------GDTPEALLKRA-- 1wc4 MSPSEQVRRAIATARQMLVALEKLNQGWQERGLVGRNEVPPVRFRCGIHQGMAVVGLFGSQERSDFTAIGPSVNIAARLQEATA 1w25 -------DEGVYQAKAS-----------GRNAVVGKAA-------- 1wc4 PNSIMVSAMVAQYVPDEEIIKREFLELKGIDEPVMTCVINPNMLNQ

Red: predicted alpha-helixBlue: predicted beta-strand*: metal-binding residues

* *

* *

Page 69: Recent Advances in Protein Sequence Analysis

ClustalW superposition DALI structuralsuperposition

PROMALS superposition

Page 70: Recent Advances in Protein Sequence Analysis

0.27

0.57

0.21

0.73

0.52

0.33

0.75

0.62

0.41

0.77

0.68

0.46

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0-10% 10-15% 15-20%

identity range

ClustalW

MUMMALS

SPEM

PROMALS

ClustalW and MUMMALS: methods that do not use additional homologs and predicted secondary structures.SPEM and PROMALS: methods that use additional homologs and predicted secondary structures.

Tests on SCOP domain pairs binned by sequence identity

*

**

*PROMALS is statistically better than other methods (P<0.0001)

Page 71: Recent Advances in Protein Sequence Analysis

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1

ClustalW accuracy PROMALS accuracy

How accurate are PROMALS alignments?

Page 72: Recent Advances in Protein Sequence Analysis

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1

ClustalW accuracy PROMALS accuracy

How accurate are PROMALS alignments?

40%

Accuracy forsequence pairswith ~7% identity

Page 73: Recent Advances in Protein Sequence Analysis

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1

ClustalW accuracy PROMALS accuracy

How accurate are PROMALS alignments?

40%

Accuracy forsequence pairswith ~7% identity

Page 74: Recent Advances in Protein Sequence Analysis

http://prodata.swmed.edu/promals

Page 75: Recent Advances in Protein Sequence Analysis

What PROMALS does not do

It does not use

explicit 3D structure modeling techniques

It uses only input sequences and internal sequence database

It predicts secondary structure from sequences, but does not build 3D models

Page 76: Recent Advances in Protein Sequence Analysis

We have a decent alignment program,

where is the catch?

SPEED (or the lack of it) !

http://prodata.swmed.edu/promals

ClustalW takes seconds to minutes per alignment

PROMALS takes minutes to hours per alignment:

average is about 30 min per family, some large families take much, much longer

Page 77: Recent Advances in Protein Sequence Analysis

We have a decent alignment program,

what NOT to do with it?

GI-GO effects: non-homologous proteins should not be an input

Low complexity proteins should not be an input: NQQQQQNNNSSSQQQQQQQQQQSSTTTTQQQQQQQQQNNsince the concept of an alignable position that can be traced to a common ancestor does not apply to them

Membrane proteins should be used with caution, since their amino acid composition is different, and we still have too few structures of them to test our algorithms thoroughly

http://prodata.swmed.edu/promals

Page 78: Recent Advances in Protein Sequence Analysis

Acknowledgement

Our group Collaborators

HHMI, NIH, UTSW,The Welch Foundation

Lisa Kinch Erik NelsonJimin Pei Ming TangSara Cheek Yuan QiShuoyong Shi Jamie WrablIndraneel M. Ruslan SadreyevYong Wang Hua ChengYi Zhong Bong-Hyun KimWei Cai Dorothee Staber

Eugene Koonin NCBI, NIHYuri Wolf NCBI, NIHEugene Shakhnovich HarvardAndrei Osterman BurnhamLeszek Rychlewski Bioinfobank,Poland