protein folding initiation site motifs chris bystroff dept of biology rensselaer polytechnic...

Protein Folding Initiation Site Protein Folding Initiation Site MotifsMotifs

Chris BystroffChris Bystroff

Dept of BiologyDept of Biology

Rensselaer Polytechnic Institute, Troy, NYRensselaer Polytechnic Institute, Troy, NY

ATCTGTATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATGATCGATCGTACATGCAGATGCCGTAGGCTAGCTAGCTAGCACTACGATGCATGCTAGCTAGCTACGACCAGTACCATGATGACTGCATGATCATACTGCCCAAAAAACGACTTAATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATGATCGATCGTACATGCAGATGCCGTAGGCTAGCTAGCTAGCACTACGATGCATGCTAGCTAGCTACGACCAGTACCATGATGACTGCATGATCATACTGCCCAAAAAACGACTTAATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATCATGATCATACTGCCCAAAAAACGACTTA

Bioinformatics = sequence analysis

Biological sequences come in two types: DNA and protein

DNA has a four-letter alphabet

Protein has a 20-letter alphabet

Sequences are an abstraction. As such, they are treated abstractly...

Sequence alignment

Phylogenetic trees

Gene finding

Data mining

"A free-standing reality"

QuickTime™ and a decompressor

are needed to see this picture.



ATGCATCAGGACTAGCTATCAGAATC

Any DNA sequence REPRESENTS a physical object, and some DNA sequences translate to protein serquences, which also REPRESENT physical objects.

behind the abstraction...

Sequence = Structure

Structure = Function

Function = Life

__________________

Sequence = Life

The protein folding problem

Unfolded Folded

This happens spontaneously (in water).

Sequence = Structure

The problem with the protein folding problem.

Number of amino acids residues in a typical protein: 100

Approximate number of degrees of freedom per residue: 3

Estimated total number of conformations (=3100): 1045

Time required to fold if all conformations are sampled at the rate of 1 per 10-15s: 1020 y

Time since the Big Bang: ~13 x 109 y

pathways

folding pathways must exist

The protein is unfolded...

...something happens first...

...then something else happens.

Early events eliminate alternative pathways

What happens first?

Helix/coil transition 10-100ns

Beta-hairpin 0.1-1.0 s

transient intermediates < 1ms

equilibrium 0.001-1.0 s

Local structure usually isn't stable

Helices and turns form quickly but just as quickly fall apart.

Most short peptides (<20aa) do not show structural stability in NMR studies.

Exceptions:A few short peptides have been shown to be conformationally stable (for example Met-enkephalin = YGGFM)

Interesting parallels between bioinformatics and semantics

language proteins

letters amino acids

words motifs

phrases modules

sentences whole proteins

meaning structure

literature genome

grammar folding??

ATCTGTATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATGATCGATCGTACATGCAGATGCCGTAGGCTAGCTAGCTAGCACTACGATGCATGCTAGCTAGCTACGACCAGTACCATGATGACTGCATGATCATACTGCCCAAAAAACGACTTAATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATGATCGATCGTACATGCAGATGCCGTAGGCTAGCTAGCTAGCACTACGATGCATGCTAGCTAGCTACGACCAGTACCATGATGACTGCATGATCATACTGCCCAAAAAACGACTTAATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATCATGATCATACTGCCCAAAAAACGACTTA

Does anyone know the words?

What if we use the enormous database of protein sequences to find recurrent short patterns?

Those short patterns would be the words.

But, are they "meaningful words"?

(Does the sequence correlate with the local structure?)

Maybe, protein folding pathways can be found in protein sequence

"grammar"1. Letters

2. Words

3. Phrases

4. Sentences

Amino acids can be groupedA C D E F G H I K L M N P Q R S T V W Y

4 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -2 A

9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -2 C

6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -3 D

5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -2 E

6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 3 F

6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -3 G

8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 2 H

4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 I

5 -2 -1 0 -1 1 2 0 -1 -2 -3 -2 K

4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 L

5 -2 -2 0 -1 -1 -1 1 -1 -1 M

6 -2 0 0 1 0 -3 -4 -2 N

7 -1 -2 -1 -1 -2 -4 -3 P

5 1 0 -1 -2 -2 -1 Q

5 -1 -1 -3 -3 -2 R

4 1 -2 -3 -2 S

5 0 -2 -2 T

4 -3 -1 V

11 2 W

7 Y

Sequence alignments show evolutionary diversity

VIVAANRSAVIVSAARTAVIASAVRTAVIVDAGRSAVIASGVRTAVIVAAKRTAVIVSAVRTPVIVSAARTAVIVSAVRTPVIVDAGRTAVIVDAGRTAVIVSGARTPVIVDFGRTPVIVSATRTPVIVSATRTPVIVGALRTPVIVSATRTPVIVSATRTPVIASAARTAVIVDAIRTPVIVAAYRTAVIVSAARTPVIVDAIRTPVIVSAVRTAVIVAAHRTA

••••••

Sequence alignment

Sequence profile

Pij wk skj aai

kseqs

wkkseqs

Sequence profiles are condensed sequence alignments

Red = high prob ratio (>3)Green = background prob ratio(~1)Blue = low prob ratio (< 1/3)

(Gribskov)

l1

|Pijl Pikl|i1,20“distance” between two points =

each dot represents a different 1-residue profile

Clustering profilesResulting clusters:

K Q RA S TA CS W Y FA P GD E NI L V MH Y

did it!

"Kmeans" clustering

Protein sequence grammar

1. Letters: amino acid profiles

2. Words

3. Phrases

4. Sentences

l1,L |Pijl Pikl |

i1,20“distance” from i to k =

each dot represents a different short profile

~120,000 segments

26 27 28 29 30 31 32position

C

FLIV

WYM

AQNT

SHRK

EDPG

AA

26 27 28 29 30 31 32

C

FLIV

WYM

AQNT

SHRK

EDPG

Clustering profile segments, length L

~800 clusters for each L

L=3,15

Learning the structure of each sequence cluster

the database

Search the database for the 400 nearest neighbors

remove all cluster members that do not conform with the paradigm

profile of cluster

cluster of nearest neighbors

After convergence, a cross-validation test is done.

I-sites library of sequence structure motifs

1000's of sequence clusters

supervised learning

Cross-validation

262 motifs

Number of different motifs after removing register variants: 31

Example of a motif

Sequences that match sequence profile....

...tend to have the same structure...

...and this is it.

Clustering finds previously known sequence-structure motifs

amphipathic -helix

amphipathic -strand-helix N-cap

p•nppn• nS••En•p •n•n

Many new motifs are found

diverging type-2 turn

Serine hairpin Type-I hairpin

Frayed helix

Proline helix C-capalpha-alpha corner

glycine helix N-cap

Why are there motifs in proteins?

Ancient conserved regions?

Selection for stability?

Folding initiation sites?

Structural features seem to drive clustering.

1. glycine at strained angles

3. negative design against alternative structures (helix)

2. conserved sidechain contacts

Number of Patternsites / 100 positions Average boundaries of conserved

Motif clusters overall confid. > 0.60 mda° dme rmsd (len) non-polar residues

1 Amphipathic -helix 13 3.1 0.9 56 0.71 0.78 (15) 1-4-8, 1-5-8

2 Non-polar -helix 6 0.9 0.12 54 0.58 0.40 (11) 1-4-8, 1-5-8

3 Schellman cap Type 1 6 0.09 0.07 81 1.01 1.02 (15) 1-6-9-114 Schellman cap Type 2 10 0.3 0.14 76 0.94 0.94 (15) 1-6-8-95 Proline -helixC cap 10 1.8 0.6 92 1.07 0.89 (13) 1-2-5-86 Frayed helix 2 1.2 0.13 75 0.96 0.69 (15) 1-5-9-137 Helix N capping box 10 1.1 0.6 99 0.95 0.65 (15) 1-6-9-138 Amphipathic -strand 8 6.8 2.1 89 0.87 0.87 (6) 1-3, 1-3-59 Hydrophobic -strand 5 2.3 0.3 101 0.91 0.91 (7) 1-2-310 -bulge 2 0.5 0.15 100 0.97 0.78 (7) 1-4-611 Serine -hairpin 4 1.3 0.3 94 0.76 0.81 (9) 1-812 Type-I hairpin 2 0.07 0.04 80 0.94 1.23 (13) 1-7-813 Diverging Type-II turn 4 0.3 0.14 87 1.04 1.00 (9) 1-7-9

I-sites sequence patterns are distinct

(Bystroff & Baker, J. Mol. Biol, 1998)

A hypothesis:

I-sites sequence motifs are folding initiation sites.

• The I-sites sequence patterns are mutually exclusive.

• Each I-sites motif is found in a variety of contexts.

• Local structure forms fast.

• Early-folding units 'initiate' folding.

One reason this hypothesis may be wrong:

Database statistics may reflect bias in the data.

Alpha helices may fold by packing interactions.

Dots show positions of alpha-carbons relative to the amphipathic helix motif. The hydrophobic side is up.

maybe not...

How do we test this hypothesis?

• See if I-sites peptides fold in isolation from the rest of the protein.

... by NMR.

... by simulation.

26 27 28 29 30 31 32position

C

FLIV

WYM

AQNT

SHRK

EDPG

AA

26 27 28 29 30 31 32

C

FLIV

WYM

AQNT

SHRK

EDPG

1 2 3 4 5 6 7position

C

FLIV

WYM

AQNT

SHRK

EDPG

AA

1 2 3 4 5 6 7

C

FLIV

WYM

AQNT

SHRK

EDPG

(a)(b)(c)(d)color scale

≥1.0.80.60.40.20.0-.2-.4-.6-.8≤-1AAAANMR structure of a 7-residue I-sites motif in isolation

(Yi et al, J. Mol. Biol, 1998)

diverging turn

Partial literature search of peptide NMR structures

I-sites motif Authors date

glycine helix cap Viguera 1995

serine hairpin Blanco 1994

Type-I hairpin deAlba 1996

diverging turn Sieber 1996

Molecular dynamics... is a cheap substitute for an NMR spectrometer.

What is MD?

• A simulation of the dynamic behavior of the molecule in water, using "first principles."

Advantages?

• You can observe the system directly.

Disadvantages?

• It's not a real system, just an approximation.

Helical peptide simulations

AAALDRMRAALEALLRAANRSHMPAARYKFIEADFKAAVAAFDGETEIAKELVVVYAKGVETADARFTKRLGATLEEKLNCNGGHWIADAVTRYWPDEAIDAYIDELTRHIRDYVRSKIAEDLVERLKEELKQALREEMVSKLKEKLLESLEEKPFGTSYEQIKAAVK

FHMYFMLRFSVMNDASFYSSYVYLGQLMALKQHNLIEAFEIEHTLNEKIQNGDWTFKAAIAQLRKKYRPETDKNPDNVVGKPMGPLLVKQAHPDLKKQDKHYGYKSYLRSLRLDLHQTYLNAVWAAIKNETHSGRKNFLEVGEYNPVKESRHPAIISAAEPLQHHNLL

PRDANTSHQDDARKLMQGIIDKLDQKMKTYFNQTLAQLSVRDFEERMNRIILDRHRRLLLKAYRRPIARMLSRVLGRDLFSCDVKFPITEVMKRLVTLNEKRILYASLRSLVYESHVGCR

Seq

uenc

es

• AMBER (parm94) force field.• Randomly chosen natural sequences• Initially extended.• 800-900 waters added.• Ions added (Na, Cl)• 7-30 ns at 340°K





The MD scheme

• Select random peptides and predict how much helix they will have, using the I-sites motif pattern.

• Run LONG simulations.

• Test to see whether they have reached equilibrium.

• If they have, find out how much of the time the peptide spent in a helical state. (by cluster analysis)

• Does the fraction helix correlate with the prediction?

Cluster analysis of trajectories1) Define a node for every step in the trajectory, keep the backbone angles (q).

2) For each node, draw an edge to every other node for which max(Dq) < 60°.

3) The node with the most edges defines the first cluster. Remove it and all its neighbors. Then the node with the most edges is the second cluster. Etc.

Clusters in conformational space

Our criterea for good clustering: no two clusters look alike, and no cluster looks like two.

RPIARMLS

This is what a trajectory looks like if it has reached equilibrium

ns

cluster number

Both halfs of the trajectory have about the same distribution.

This is what it looks like if it has not.

ns

cluster number

NAIIQELE movie





A rough energy landscape.

There is a correlation between I-sites sequence score and the simulations

r=0.48 (all peptides)r=0.61 (trajectories > 20ns long)

Sampling of sequence space

72 peptides were simulated. Is this a representative sample of the space of amphipathic helix sequences?

I-sites motif 72 peptides, weighted by %helix

72 peptides, unweighted

What this means?The MD experiment separates the local effects from the non-local effects on helix formation.

In the simulation, there are only local interations.

So the propensity for amphipathic sequences to form helix is mostly intrinsic.

Outliers• Simulation too short.

We see only meta-stable states.

• I-sites scoring method is missing something.

Using additive probabilities ignores statistical dependence between different positions.

• Part-helix was not counted as helix in this study.Helix caps are competing motifs.

(+-) and (-+) look just like (++) and (--)

QVFMRIME (a helix in 1dldA)

Predicted to be helix with confidence = 0.86

Zero helix found in 17ns trajectory. What does it fold into?





an outlier


1. Letters: Amino acid profiles

2. Words: I-sites motifs

3. Phrases:

4. Sentences


1. Letters: Amino acid profiles


3. Phrases: a hidden Markov model

4. Sentences

Motif “grammar”?

Arrangement of I-sites motifs in proteins is highly non-random

helixhelix cap

betastrand

betaturn

The dependencies can be modeled as a Markov chain

the

mailman

dog

bit

kicked back

The dog bit the mailman. The mailman kicked the dog back.

Markov model

Sequence data

Stochastic output The dog back. The mailman kicked the mailman kicked the dog bit the dog bit the dog bit the mailman kicked the dog. ...

How to make a Markov chain

A "hidden" Markov model

What's "hidden" about it?

An HMM is a Markov chain where the meaning of the Markov state is probabilistic.

the

mailman 0.5postman 0.5

dog

bit 0.3attacked 0.7

kicked 0.6hit 0.4

back

The dog bit the mailman. The mailman kicked the dog back. The dog attacked the postman. The postman hit the dog.

hidden Markov model

Sequence alignment data

Stochastic output The dog back. The mailman kicked the postman kicked the dog bit the dog bit the dog attacked the mailman kicked the dog. ...

How to make a hidden Markov chain

One Markov state from HMMSTR

ahi

aij

aik

regions

sequence profile

One state emits one letter of each type (b,r,d,c)

probabilitic meaning of the state

amino acid symbols

structure

symbols

bi = {ACDEF...}

ri = {HGEBdblLex}

di = {HST}

ci = {mnhd...}{previous letter(s)

next letter(s)

Constructing a HMM by aligning motifs

Related motifs, branched model.φ -1TypeG -C cap

-2TypeG -C caphelix-2TypeG -C caphelix-1TypeG -C cap

Merging many motifs into one HMM

HMMSTR

Hidden Markov Model for local protein STRucture

282 nodes

317 transitions

Unified model for 31 distinct sequence-structure motifs

(Bystroff & Baker, J. Mol. Biol., 2000)

Variations on a motif theme are modeled as parallel paths

Multiple state-pathways for the helix N-cap motif

Common sub-graphs represent common sub-structures

These peptide segments have the same state sequence (except shaded residues)

How an HMM works

P Q |S( ) =πq1 (S1) aqi −1qi bqi (Si )i=2,N∏

initiation probability

transition probability

emission probability

We have S (the sequence). We want Q (the 1D structure), and P (how well S fits Q)

3-state secondary structure prediction

74.9% correct

74.6% correct

Predicting super-secondary context

Results are for the independent test set.

Fully-automated tertiary structure prediction

(1) Find homologues in the database (Psi-Blast)

(2) Predict local structure (HMMSTR)

(3) Assemble fragments (ROSETTA, D.Baker)

sequence

structure

Protocol used for CAFASP2 experiment (2000)

Rosetta ab initio

Scoring function: Bayesian classification of pairwise secondary structure contact types.

Search function: Monte Carlo fragment insertion. A move consists of selecting a fragment at random from a set of local structure predictions. Coordinates are re-generated after swapping in the new fragment.

(Simons et al, PNAS, 1997)

CASP3 Prediction results for Target 56 : DNA helicase

Predicted structure of 66-residue fragment (23-88)

True structure of same fragment

CAFASP Prediction results for Target 122: 1GEQ Tryptophan Synthase

Predicted 97-residue fragment

True structure of same fragment

Protein sequence grammer

1. Alphabet: amino acid profile


3. Phrases: HMMSTR pathways

4. Sentences: contact maps

the next step...

In progress:Data mining of contact maps

HMMSTR predictions

Protein sequences + contact maps

Association-rule mining (M. Zaki)

Rules for tertiary contacts

Predicting tertiary contacts

Contact predictions for 2igd

overall : 20% coverage w/20% accuracy

Can the 2D map be translated to 3D?

I-sites/HMMSTR collaborators

David Baker U. WashingtonKaren Han UCSFVestienn Thorsson U.WashingtonQian Yi U. WashingtonEdward Thayer ZymogeneticsShekhar Garde RPIMohammed Zaki RPISusan Baxter Wadsworth (->Novartis)Chip Lawrence Wadsworth/RPIBobbie Jo Webb WadsworthKim Simons U. Washington (->Harvard)

Bystroff Lab

Yu Shao

Xin Yuan

Jerry Huang

isites.bio.rpi.edu

protein folding initiation site motifs chris bystroff dept of biology rensselaer polytechnic...

Documents

dna sequences

protein serquences

lifethe protein

problem unfoldedfoldedthis

structurethe problem

physical objects

fourletter alphabetprotein

letter alphabetsequences