protein folding initiation site motifs chris bystroff dept of biology rensselaer polytechnic...

71
Protein Folding Protein Folding Initiation Site Motifs Initiation Site Motifs Chris Bystroff Chris Bystroff Dept of Biology Dept of Biology Rensselaer Polytechnic Institute, Troy, NY Rensselaer Polytechnic Institute, Troy, NY

Post on 15-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Protein Folding Initiation Site Protein Folding Initiation Site MotifsMotifs

Chris BystroffChris Bystroff

Dept of BiologyDept of Biology

Rensselaer Polytechnic Institute, Troy, NYRensselaer Polytechnic Institute, Troy, NY

Page 2: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

ATCTGTATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATGATCGATCGTACATGCAGATGCCGTAGGCTAGCTAGCTAGCACTACGATGCATGCTAGCTAGCTACGACCAGTACCATGATGACTGCATGATCATACTGCCCAAAAAACGACTTAATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATGATCGATCGTACATGCAGATGCCGTAGGCTAGCTAGCTAGCACTACGATGCATGCTAGCTAGCTACGACCAGTACCATGATGACTGCATGATCATACTGCCCAAAAAACGACTTAATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATCATGATCATACTGCCCAAAAAACGACTTA

Bioinformatics = sequence analysis

Biological sequences come in two types: DNA and protein

DNA has a four-letter alphabet

Protein has a 20-letter alphabet

Sequences are an abstraction. As such, they are treated abstractly...

Sequence alignment

Phylogenetic trees

Gene finding

Data mining

Page 3: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

"A free-standing reality"

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

ATGCATCAGGACTAGCTATCAGAATC

Any DNA sequence REPRESENTS a physical object, and some DNA sequences translate to protein serquences, which also REPRESENT physical objects.

behind the abstraction...

Page 4: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Sequence = Structure

Structure = Function

Function = Life

__________________

Sequence = Life

Page 5: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

The protein folding problem

Unfolded Folded

This happens spontaneously (in water).

Sequence = Structure

Page 6: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

The problem with the protein folding problem.

Number of amino acids residues in a typical protein: 100

Approximate number of degrees of freedom per residue: 3

Estimated total number of conformations (=3100): 1045

Time required to fold if all conformations are sampled at the rate of 1 per 10-15s: 1020 y

Time since the Big Bang: ~13 x 109 y

Page 7: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

pathways

Page 8: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

folding pathways must exist

The protein is unfolded...

...something happens first...

...then something else happens.

Page 9: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Early events eliminate alternative pathways

Page 10: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

What happens first?

Helix/coil transition 10-100ns

Beta-hairpin 0.1-1.0 s

transient intermediates < 1ms

equilibrium 0.001-1.0 s

Page 11: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Local structure usually isn't stable

Helices and turns form quickly but just as quickly fall apart.

Most short peptides (<20aa) do not show structural stability in NMR studies.

Exceptions:A few short peptides have been shown to be conformationally stable (for example Met-enkephalin = YGGFM)

Page 12: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Interesting parallels between bioinformatics and semantics

language proteins

letters amino acids

words motifs

phrases modules

sentences whole proteins

meaning structure

literature genome

grammar folding??

Page 13: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

ATCTGTATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATGATCGATCGTACATGCAGATGCCGTAGGCTAGCTAGCTAGCACTACGATGCATGCTAGCTAGCTACGACCAGTACCATGATGACTGCATGATCATACTGCCCAAAAAACGACTTAATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATGATCGATCGTACATGCAGATGCCGTAGGCTAGCTAGCTAGCACTACGATGCATGCTAGCTAGCTACGACCAGTACCATGATGACTGCATGATCATACTGCCCAAAAAACGACTTAATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATCATGATCATACTGCCCAAAAAACGACTTA

Does anyone know the words?

What if we use the enormous database of protein sequences to find recurrent short patterns?

Those short patterns would be the words.

But, are they "meaningful words"?

(Does the sequence correlate with the local structure?)

Page 14: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Maybe, protein folding pathways can be found in protein sequence

"grammar"1. Letters

2. Words

3. Phrases

4. Sentences

Page 15: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Amino acids can be groupedA C D E F G H I K L M N P Q R S T V W Y

4 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -2 A

9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -2 C

6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -3 D

5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -2 E

6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 3 F

6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -3 G

8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 2 H

4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 I

5 -2 -1 0 -1 1 2 0 -1 -2 -3 -2 K

4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 L

5 -2 -2 0 -1 -1 -1 1 -1 -1 M

6 -2 0 0 1 0 -3 -4 -2 N

7 -1 -2 -1 -1 -2 -4 -3 P

5 1 0 -1 -2 -2 -1 Q

5 -1 -1 -3 -3 -2 R

4 1 -2 -3 -2 S

5 0 -2 -2 T

4 -3 -1 V

11 2 W

7 Y

Page 16: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Sequence alignments show evolutionary diversity

Page 17: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

VIVAANRSAVIVSAARTAVIASAVRTAVIVDAGRSAVIASGVRTAVIVAAKRTAVIVSAVRTPVIVSAARTAVIVSAVRTPVIVDAGRTAVIVDAGRTAVIVSGARTPVIVDFGRTPVIVSATRTPVIVSATRTPVIVGALRTPVIVSATRTPVIVSATRTPVIASAARTAVIVDAIRTPVIVAAYRTAVIVSAARTPVIVDAIRTPVIVSAVRTAVIVAAHRTA

••••••

Sequence alignment

Sequence profile

Pij wk skj aai

kseqs

wkkseqs

Sequence profiles are condensed sequence alignments

Red = high prob ratio (>3)Green = background prob ratio(~1)Blue = low prob ratio (< 1/3)

(Gribskov)

Page 18: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

l1

|Pijl Pikl|i1,20“distance” between two points =

each dot represents a different 1-residue profile

Clustering profilesResulting clusters:

K Q RA S TA CS W Y FA P GD E NI L V MH Y

did it!

"Kmeans" clustering

Page 19: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Protein sequence grammar

1. Letters: amino acid profiles

2. Words

3. Phrases

4. Sentences

Page 20: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Protein sequence grammar

1. Letters: amino acid profiles

2. Words

3. Phrases

4. Sentences

Page 21: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

l1,L |Pijl Pikl |

i1,20“distance” from i to k =

each dot represents a different short profile

~120,000 segments

26 27 28 29 30 31 32position

C

FLIV

WYM

AQNT

SHRK

EDPG

AA

26 27 28 29 30 31 32

C

FLIV

WYM

AQNT

SHRK

EDPG

Clustering profile segments, length L

~800 clusters for each L

L=3,15

Page 22: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Learning the structure of each sequence cluster

the database

Search the database for the 400 nearest neighbors

remove all cluster members that do not conform with the paradigm

profile of cluster

cluster of nearest neighbors

After convergence, a cross-validation test is done.

Page 23: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

I-sites library of sequence structure motifs

1000's of sequence clusters

supervised learning

Cross-validation

262 motifs

Number of different motifs after removing register variants: 31

Page 24: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Example of a motif

Sequences that match sequence profile....

...tend to have the same structure...

...and this is it.

Page 25: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Clustering finds previously known sequence-structure motifs

amphipathic -helix

amphipathic -strand-helix N-cap

p•nppn• nS••En•p •n•n

Page 26: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Many new motifs are found

diverging type-2 turn

Serine hairpin Type-I hairpin

Frayed helix

Proline helix C-capalpha-alpha corner

glycine helix N-cap

Page 27: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY
Page 28: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Why are there motifs in proteins?

Ancient conserved regions?

Selection for stability?

Folding initiation sites?

Page 29: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Structural features seem to drive clustering.

1. glycine at strained angles

3. negative design against alternative structures (helix)

2. conserved sidechain contacts

Page 30: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Number of Patternsites / 100 positions Average boundaries of conserved

Motif clusters overall confid. > 0.60 mda° dme rmsd (len) non-polar residues

1 Amphipathic -helix 13 3.1 0.9 56 0.71 0.78 (15) 1-4-8, 1-5-8

2 Non-polar -helix 6 0.9 0.12 54 0.58 0.40 (11) 1-4-8, 1-5-8

3 Schellman cap Type 1 6 0.09 0.07 81 1.01 1.02 (15) 1-6-9-114 Schellman cap Type 2 10 0.3 0.14 76 0.94 0.94 (15) 1-6-8-95 Proline -helixC cap 10 1.8 0.6 92 1.07 0.89 (13) 1-2-5-86 Frayed helix 2 1.2 0.13 75 0.96 0.69 (15) 1-5-9-137 Helix N capping box 10 1.1 0.6 99 0.95 0.65 (15) 1-6-9-138 Amphipathic -strand 8 6.8 2.1 89 0.87 0.87 (6) 1-3, 1-3-59 Hydrophobic -strand 5 2.3 0.3 101 0.91 0.91 (7) 1-2-310 -bulge 2 0.5 0.15 100 0.97 0.78 (7) 1-4-611 Serine -hairpin 4 1.3 0.3 94 0.76 0.81 (9) 1-812 Type-I hairpin 2 0.07 0.04 80 0.94 1.23 (13) 1-7-813 Diverging Type-II turn 4 0.3 0.14 87 1.04 1.00 (9) 1-7-9

I-sites sequence patterns are distinct

(Bystroff & Baker, J. Mol. Biol, 1998)

Page 31: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

A hypothesis:

I-sites sequence motifs are folding initiation sites.

• The I-sites sequence patterns are mutually exclusive.

• Each I-sites motif is found in a variety of contexts.

• Local structure forms fast.

• Early-folding units 'initiate' folding.

One reason this hypothesis may be wrong:

Database statistics may reflect bias in the data.

Page 32: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Alpha helices may fold by packing interactions.

Dots show positions of alpha-carbons relative to the amphipathic helix motif. The hydrophobic side is up.

maybe not...

Page 33: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

How do we test this hypothesis?

• See if I-sites peptides fold in isolation from the rest of the protein.

... by NMR.

... by simulation.

Page 34: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

26 27 28 29 30 31 32position

C

FLIV

WYM

AQNT

SHRK

EDPG

AA

26 27 28 29 30 31 32

C

FLIV

WYM

AQNT

SHRK

EDPG

1 2 3 4 5 6 7position

C

FLIV

WYM

AQNT

SHRK

EDPG

AA

1 2 3 4 5 6 7

C

FLIV

WYM

AQNT

SHRK

EDPG

(a)(b)(c)(d)color scale

≥1.0.80.60.40.20.0-.2-.4-.6-.8≤-1AAAANMR structure of a 7-residue I-sites motif in isolation

(Yi et al, J. Mol. Biol, 1998)

diverging turn

Page 35: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Partial literature search of peptide NMR structures

I-sites motif Authors date

glycine helix cap Viguera 1995

serine hairpin Blanco 1994

Type-I hairpin deAlba 1996

diverging turn Sieber 1996

Page 36: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Molecular dynamics... is a cheap substitute for an NMR spectrometer.

What is MD?

• A simulation of the dynamic behavior of the molecule in water, using "first principles."

Advantages?

• You can observe the system directly.

Disadvantages?

• It's not a real system, just an approximation.

Page 37: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Helical peptide simulations

AAALDRMRAALEALLRAANRSHMPAARYKFIEADFKAAVAAFDGETEIAKELVVVYAKGVETADARFTKRLGATLEEKLNCNGGHWIADAVTRYWPDEAIDAYIDELTRHIRDYVRSKIAEDLVERLKEELKQALREEMVSKLKEKLLESLEEKPFGTSYEQIKAAVK

FHMYFMLRFSVMNDASFYSSYVYLGQLMALKQHNLIEAFEIEHTLNEKIQNGDWTFKAAIAQLRKKYRPETDKNPDNVVGKPMGPLLVKQAHPDLKKQDKHYGYKSYLRSLRLDLHQTYLNAVWAAIKNETHSGRKNFLEVGEYNPVKESRHPAIISAAEPLQHHNLL

PRDANTSHQDDARKLMQGIIDKLDQKMKTYFNQTLAQLSVRDFEERMNRIILDRHRRLLLKAYRRPIARMLSRVLGRDLFSCDVKFPITEVMKRLVTLNEKRILYASLRSLVYESHVGCR

Seq

uenc

es

• AMBER (parm94) force field.• Randomly chosen natural sequences• Initially extended.• 800-900 waters added.• Ions added (Na, Cl)• 7-30 ns at 340°K

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Page 38: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

The MD scheme

• Select random peptides and predict how much helix they will have, using the I-sites motif pattern.

• Run LONG simulations.

• Test to see whether they have reached equilibrium.

• If they have, find out how much of the time the peptide spent in a helical state. (by cluster analysis)

• Does the fraction helix correlate with the prediction?

Page 39: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Cluster analysis of trajectories1) Define a node for every step in the trajectory, keep the backbone angles (q).

2) For each node, draw an edge to every other node for which max(Dq) < 60°.

3) The node with the most edges defines the first cluster. Remove it and all its neighbors. Then the node with the most edges is the second cluster. Etc.

Page 40: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Clusters in conformational space

Our criterea for good clustering: no two clusters look alike, and no cluster looks like two.

RPIARMLS

Page 41: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

This is what a trajectory looks like if it has reached equilibrium

ns

cluster number

Both halfs of the trajectory have about the same distribution.

Page 42: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

This is what it looks like if it has not.

ns

cluster number

Page 43: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

NAIIQELE movie

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

A rough energy landscape.

Page 44: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

There is a correlation between I-sites sequence score and the simulations

r=0.48 (all peptides)r=0.61 (trajectories > 20ns long)

Page 45: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Sampling of sequence space

72 peptides were simulated. Is this a representative sample of the space of amphipathic helix sequences?

I-sites motif 72 peptides, weighted by %helix

72 peptides, unweighted

Page 46: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

What this means?The MD experiment separates the local effects from the non-local effects on helix formation.

In the simulation, there are only local interations.

So the propensity for amphipathic sequences to form helix is mostly intrinsic.

Page 47: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Outliers• Simulation too short.

We see only meta-stable states.

• I-sites scoring method is missing something.

Using additive probabilities ignores statistical dependence between different positions.

• Part-helix was not counted as helix in this study.Helix caps are competing motifs.

(+-) and (-+) look just like (++) and (--)

Page 48: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

QVFMRIME (a helix in 1dldA)

Predicted to be helix with confidence = 0.86

Zero helix found in 17ns trajectory. What does it fold into?

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

an outlier

Page 49: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Protein sequence grammar

1. Letters: Amino acid profiles

2. Words: I-sites motifs

3. Phrases:

4. Sentences

Page 50: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Protein sequence grammar

1. Letters: Amino acid profiles

2. Words: I-sites motifs

3. Phrases: a hidden Markov model

4. Sentences

Page 51: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Motif “grammar”?

Arrangement of I-sites motifs in proteins is highly non-random

helixhelix cap

betastrand

betaturn

The dependencies can be modeled as a Markov chain

Page 52: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

the

mailman

dog

bit

kicked back

The dog bit the mailman. The mailman kicked the dog back.

Markov model

Sequence data

Stochastic output The dog back. The mailman kicked the mailman kicked the dog bit the dog bit the dog bit the mailman kicked the dog. ...

How to make a Markov chain

Page 53: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

A "hidden" Markov model

What's "hidden" about it?

An HMM is a Markov chain where the meaning of the Markov state is probabilistic.

Page 54: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

the

mailman 0.5postman 0.5

dog

bit 0.3attacked 0.7

kicked 0.6hit 0.4

back

The dog bit the mailman. The mailman kicked the dog back. The dog attacked the postman. The postman hit the dog.

hidden Markov model

Sequence alignment data

Stochastic output The dog back. The mailman kicked the postman kicked the dog bit the dog bit the dog attacked the mailman kicked the dog. ...

How to make a hidden Markov chain

Page 55: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

One Markov state from HMMSTR

ahi

aij

aik

regions

sequence profile

One state emits one letter of each type (b,r,d,c)

probabilitic meaning of the state

amino acid symbols

structure

symbols

bi = {ACDEF...}

ri = {HGEBdblLex}

di = {HST}

ci = {mnhd...}{previous letter(s)

next letter(s)

Page 56: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Constructing a HMM by aligning motifs

Page 57: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Related motifs, branched model.φ -1TypeG -C cap

-2TypeG -C caphelix-2TypeG -C caphelix-1TypeG -C cap

Merging many motifs into one HMM

Page 58: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

HMMSTR

Hidden Markov Model for local protein STRucture

282 nodes

317 transitions

Unified model for 31 distinct sequence-structure motifs

(Bystroff & Baker, J. Mol. Biol., 2000)

Page 59: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Variations on a motif theme are modeled as parallel paths

Multiple state-pathways for the helix N-cap motif

Page 60: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Common sub-graphs represent common sub-structures

These peptide segments have the same state sequence (except shaded residues)

Page 61: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

How an HMM works

P Q |S( ) =πq1 (S1) aqi −1qi bqi (Si )i=2,N∏

initiation probability

transition probability

emission probability

We have S (the sequence). We want Q (the 1D structure), and P (how well S fits Q)

Page 62: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

3-state secondary structure prediction

74.9% correct

74.6% correct

Page 63: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Predicting super-secondary context

Results are for the independent test set.

Page 64: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Fully-automated tertiary structure prediction

(1) Find homologues in the database (Psi-Blast)

(2) Predict local structure (HMMSTR)

(3) Assemble fragments (ROSETTA, D.Baker)

sequence

structure

Protocol used for CAFASP2 experiment (2000)

Page 65: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Rosetta ab initio

Scoring function: Bayesian classification of pairwise secondary structure contact types.

Search function: Monte Carlo fragment insertion. A move consists of selecting a fragment at random from a set of local structure predictions. Coordinates are re-generated after swapping in the new fragment.

(Simons et al, PNAS, 1997)

Page 66: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

CASP3 Prediction results for Target 56 : DNA helicase

Predicted structure of 66-residue fragment (23-88)

True structure of same fragment

Page 67: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

CAFASP Prediction results for Target 122: 1GEQ Tryptophan Synthase

Predicted 97-residue fragment

True structure of same fragment

Page 68: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Protein sequence grammer

1. Alphabet: amino acid profile

2. Words: I-sites motifs

3. Phrases: HMMSTR pathways

4. Sentences: contact maps

the next step...

Page 69: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

In progress:Data mining of contact maps

HMMSTR predictions

Protein sequences + contact maps

Association-rule mining (M. Zaki)

Rules for tertiary contacts

Page 70: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

Predicting tertiary contacts

Contact predictions for 2igd

overall : 20% coverage w/20% accuracy

Can the 2D map be translated to 3D?

Page 71: Protein Folding Initiation Site Motifs Chris Bystroff Dept of Biology Rensselaer Polytechnic Institute, Troy, NY

I-sites/HMMSTR collaborators

David Baker U. WashingtonKaren Han UCSFVestienn Thorsson U.WashingtonQian Yi U. WashingtonEdward Thayer ZymogeneticsShekhar Garde RPIMohammed Zaki RPISusan Baxter Wadsworth (->Novartis)Chip Lawrence Wadsworth/RPIBobbie Jo Webb WadsworthKim Simons U. Washington (->Harvard)

Bystroff Lab

Yu Shao

Xin Yuan

Jerry Huang

isites.bio.rpi.edu