protein folding initiation site motifs chris bystroff dept of biology rensselaer polytechnic...
Post on 15-Jan-2016
217 views
TRANSCRIPT
Protein Folding Initiation Site Protein Folding Initiation Site MotifsMotifs
Chris BystroffChris Bystroff
Dept of BiologyDept of Biology
Rensselaer Polytechnic Institute, Troy, NYRensselaer Polytechnic Institute, Troy, NY
ATCTGTATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATGATCGATCGTACATGCAGATGCCGTAGGCTAGCTAGCTAGCACTACGATGCATGCTAGCTAGCTACGACCAGTACCATGATGACTGCATGATCATACTGCCCAAAAAACGACTTAATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATGATCGATCGTACATGCAGATGCCGTAGGCTAGCTAGCTAGCACTACGATGCATGCTAGCTAGCTACGACCAGTACCATGATGACTGCATGATCATACTGCCCAAAAAACGACTTAATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATCATGATCATACTGCCCAAAAAACGACTTA
Bioinformatics = sequence analysis
Biological sequences come in two types: DNA and protein
DNA has a four-letter alphabet
Protein has a 20-letter alphabet
Sequences are an abstraction. As such, they are treated abstractly...
Sequence alignment
Phylogenetic trees
Gene finding
Data mining
"A free-standing reality"
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
ATGCATCAGGACTAGCTATCAGAATC
Any DNA sequence REPRESENTS a physical object, and some DNA sequences translate to protein serquences, which also REPRESENT physical objects.
behind the abstraction...
Sequence = Structure
Structure = Function
Function = Life
__________________
Sequence = Life
The protein folding problem
Unfolded Folded
This happens spontaneously (in water).
Sequence = Structure
The problem with the protein folding problem.
Number of amino acids residues in a typical protein: 100
Approximate number of degrees of freedom per residue: 3
Estimated total number of conformations (=3100): 1045
Time required to fold if all conformations are sampled at the rate of 1 per 10-15s: 1020 y
Time since the Big Bang: ~13 x 109 y
pathways
folding pathways must exist
The protein is unfolded...
...something happens first...
...then something else happens.
Early events eliminate alternative pathways
What happens first?
Helix/coil transition 10-100ns
Beta-hairpin 0.1-1.0 s
transient intermediates < 1ms
equilibrium 0.001-1.0 s
Local structure usually isn't stable
Helices and turns form quickly but just as quickly fall apart.
Most short peptides (<20aa) do not show structural stability in NMR studies.
Exceptions:A few short peptides have been shown to be conformationally stable (for example Met-enkephalin = YGGFM)
Interesting parallels between bioinformatics and semantics
language proteins
letters amino acids
words motifs
phrases modules
sentences whole proteins
meaning structure
literature genome
grammar folding??
ATCTGTATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATGATCGATCGTACATGCAGATGCCGTAGGCTAGCTAGCTAGCACTACGATGCATGCTAGCTAGCTACGACCAGTACCATGATGACTGCATGATCATACTGCCCAAAAAACGACTTAATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATGATCGATCGTACATGCAGATGCCGTAGGCTAGCTAGCTAGCACTACGATGCATGCTAGCTAGCTACGACCAGTACCATGATGACTGCATGATCATACTGCCCAAAAAACGACTTAATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGGAGGGGGGTCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATCATGATCATACTGCCCAAAAAACGACTTA
Does anyone know the words?
What if we use the enormous database of protein sequences to find recurrent short patterns?
Those short patterns would be the words.
But, are they "meaningful words"?
(Does the sequence correlate with the local structure?)
Maybe, protein folding pathways can be found in protein sequence
"grammar"1. Letters
2. Words
3. Phrases
4. Sentences
Amino acids can be groupedA C D E F G H I K L M N P Q R S T V W Y
4 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -2 A
9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -2 C
6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -3 D
5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -2 E
6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 3 F
6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -3 G
8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 2 H
4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 I
5 -2 -1 0 -1 1 2 0 -1 -2 -3 -2 K
4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 L
5 -2 -2 0 -1 -1 -1 1 -1 -1 M
6 -2 0 0 1 0 -3 -4 -2 N
7 -1 -2 -1 -1 -2 -4 -3 P
5 1 0 -1 -2 -2 -1 Q
5 -1 -1 -3 -3 -2 R
4 1 -2 -3 -2 S
5 0 -2 -2 T
4 -3 -1 V
11 2 W
7 Y
Sequence alignments show evolutionary diversity
VIVAANRSAVIVSAARTAVIASAVRTAVIVDAGRSAVIASGVRTAVIVAAKRTAVIVSAVRTPVIVSAARTAVIVSAVRTPVIVDAGRTAVIVDAGRTAVIVSGARTPVIVDFGRTPVIVSATRTPVIVSATRTPVIVGALRTPVIVSATRTPVIVSATRTPVIASAARTAVIVDAIRTPVIVAAYRTAVIVSAARTPVIVDAIRTPVIVSAVRTAVIVAAHRTA
••••••
Sequence alignment
Sequence profile
Pij wk skj aai
kseqs
wkkseqs
Sequence profiles are condensed sequence alignments
Red = high prob ratio (>3)Green = background prob ratio(~1)Blue = low prob ratio (< 1/3)
(Gribskov)
l1
|Pijl Pikl|i1,20“distance” between two points =
each dot represents a different 1-residue profile
Clustering profilesResulting clusters:
K Q RA S TA CS W Y FA P GD E NI L V MH Y
did it!
"Kmeans" clustering
Protein sequence grammar
1. Letters: amino acid profiles
2. Words
3. Phrases
4. Sentences
Protein sequence grammar
1. Letters: amino acid profiles
2. Words
3. Phrases
4. Sentences
l1,L |Pijl Pikl |
i1,20“distance” from i to k =
each dot represents a different short profile
~120,000 segments
26 27 28 29 30 31 32position
C
FLIV
WYM
AQNT
SHRK
EDPG
AA
26 27 28 29 30 31 32
C
FLIV
WYM
AQNT
SHRK
EDPG
Clustering profile segments, length L
~800 clusters for each L
L=3,15
Learning the structure of each sequence cluster
the database
Search the database for the 400 nearest neighbors
remove all cluster members that do not conform with the paradigm
profile of cluster
cluster of nearest neighbors
After convergence, a cross-validation test is done.
I-sites library of sequence structure motifs
1000's of sequence clusters
supervised learning
Cross-validation
262 motifs
Number of different motifs after removing register variants: 31
Example of a motif
Sequences that match sequence profile....
...tend to have the same structure...
...and this is it.
Clustering finds previously known sequence-structure motifs
amphipathic -helix
amphipathic -strand-helix N-cap
p•nppn• nS••En•p •n•n
Many new motifs are found
diverging type-2 turn
Serine hairpin Type-I hairpin
Frayed helix
Proline helix C-capalpha-alpha corner
glycine helix N-cap
Why are there motifs in proteins?
Ancient conserved regions?
Selection for stability?
Folding initiation sites?
Structural features seem to drive clustering.
1. glycine at strained angles
3. negative design against alternative structures (helix)
2. conserved sidechain contacts
Number of Patternsites / 100 positions Average boundaries of conserved
Motif clusters overall confid. > 0.60 mda° dme rmsd (len) non-polar residues
1 Amphipathic -helix 13 3.1 0.9 56 0.71 0.78 (15) 1-4-8, 1-5-8
2 Non-polar -helix 6 0.9 0.12 54 0.58 0.40 (11) 1-4-8, 1-5-8
3 Schellman cap Type 1 6 0.09 0.07 81 1.01 1.02 (15) 1-6-9-114 Schellman cap Type 2 10 0.3 0.14 76 0.94 0.94 (15) 1-6-8-95 Proline -helixC cap 10 1.8 0.6 92 1.07 0.89 (13) 1-2-5-86 Frayed helix 2 1.2 0.13 75 0.96 0.69 (15) 1-5-9-137 Helix N capping box 10 1.1 0.6 99 0.95 0.65 (15) 1-6-9-138 Amphipathic -strand 8 6.8 2.1 89 0.87 0.87 (6) 1-3, 1-3-59 Hydrophobic -strand 5 2.3 0.3 101 0.91 0.91 (7) 1-2-310 -bulge 2 0.5 0.15 100 0.97 0.78 (7) 1-4-611 Serine -hairpin 4 1.3 0.3 94 0.76 0.81 (9) 1-812 Type-I hairpin 2 0.07 0.04 80 0.94 1.23 (13) 1-7-813 Diverging Type-II turn 4 0.3 0.14 87 1.04 1.00 (9) 1-7-9
I-sites sequence patterns are distinct
(Bystroff & Baker, J. Mol. Biol, 1998)
A hypothesis:
I-sites sequence motifs are folding initiation sites.
• The I-sites sequence patterns are mutually exclusive.
• Each I-sites motif is found in a variety of contexts.
• Local structure forms fast.
• Early-folding units 'initiate' folding.
One reason this hypothesis may be wrong:
Database statistics may reflect bias in the data.
Alpha helices may fold by packing interactions.
Dots show positions of alpha-carbons relative to the amphipathic helix motif. The hydrophobic side is up.
maybe not...
How do we test this hypothesis?
• See if I-sites peptides fold in isolation from the rest of the protein.
... by NMR.
... by simulation.
26 27 28 29 30 31 32position
C
FLIV
WYM
AQNT
SHRK
EDPG
AA
26 27 28 29 30 31 32
C
FLIV
WYM
AQNT
SHRK
EDPG
1 2 3 4 5 6 7position
C
FLIV
WYM
AQNT
SHRK
EDPG
AA
1 2 3 4 5 6 7
C
FLIV
WYM
AQNT
SHRK
EDPG
(a)(b)(c)(d)color scale
≥1.0.80.60.40.20.0-.2-.4-.6-.8≤-1AAAANMR structure of a 7-residue I-sites motif in isolation
(Yi et al, J. Mol. Biol, 1998)
diverging turn
Partial literature search of peptide NMR structures
I-sites motif Authors date
glycine helix cap Viguera 1995
serine hairpin Blanco 1994
Type-I hairpin deAlba 1996
diverging turn Sieber 1996
Molecular dynamics... is a cheap substitute for an NMR spectrometer.
What is MD?
• A simulation of the dynamic behavior of the molecule in water, using "first principles."
Advantages?
• You can observe the system directly.
Disadvantages?
• It's not a real system, just an approximation.
Helical peptide simulations
AAALDRMRAALEALLRAANRSHMPAARYKFIEADFKAAVAAFDGETEIAKELVVVYAKGVETADARFTKRLGATLEEKLNCNGGHWIADAVTRYWPDEAIDAYIDELTRHIRDYVRSKIAEDLVERLKEELKQALREEMVSKLKEKLLESLEEKPFGTSYEQIKAAVK
FHMYFMLRFSVMNDASFYSSYVYLGQLMALKQHNLIEAFEIEHTLNEKIQNGDWTFKAAIAQLRKKYRPETDKNPDNVVGKPMGPLLVKQAHPDLKKQDKHYGYKSYLRSLRLDLHQTYLNAVWAAIKNETHSGRKNFLEVGEYNPVKESRHPAIISAAEPLQHHNLL
PRDANTSHQDDARKLMQGIIDKLDQKMKTYFNQTLAQLSVRDFEERMNRIILDRHRRLLLKAYRRPIARMLSRVLGRDLFSCDVKFPITEVMKRLVTLNEKRILYASLRSLVYESHVGCR
Seq
uenc
es
• AMBER (parm94) force field.• Randomly chosen natural sequences• Initially extended.• 800-900 waters added.• Ions added (Na, Cl)• 7-30 ns at 340°K
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
The MD scheme
• Select random peptides and predict how much helix they will have, using the I-sites motif pattern.
• Run LONG simulations.
• Test to see whether they have reached equilibrium.
• If they have, find out how much of the time the peptide spent in a helical state. (by cluster analysis)
• Does the fraction helix correlate with the prediction?
Cluster analysis of trajectories1) Define a node for every step in the trajectory, keep the backbone angles (q).
2) For each node, draw an edge to every other node for which max(Dq) < 60°.
3) The node with the most edges defines the first cluster. Remove it and all its neighbors. Then the node with the most edges is the second cluster. Etc.
Clusters in conformational space
Our criterea for good clustering: no two clusters look alike, and no cluster looks like two.
RPIARMLS
This is what a trajectory looks like if it has reached equilibrium
ns
cluster number
Both halfs of the trajectory have about the same distribution.
This is what it looks like if it has not.
ns
cluster number
NAIIQELE movie
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
A rough energy landscape.
There is a correlation between I-sites sequence score and the simulations
r=0.48 (all peptides)r=0.61 (trajectories > 20ns long)
Sampling of sequence space
72 peptides were simulated. Is this a representative sample of the space of amphipathic helix sequences?
I-sites motif 72 peptides, weighted by %helix
72 peptides, unweighted
What this means?The MD experiment separates the local effects from the non-local effects on helix formation.
In the simulation, there are only local interations.
So the propensity for amphipathic sequences to form helix is mostly intrinsic.
Outliers• Simulation too short.
We see only meta-stable states.
• I-sites scoring method is missing something.
Using additive probabilities ignores statistical dependence between different positions.
• Part-helix was not counted as helix in this study.Helix caps are competing motifs.
(+-) and (-+) look just like (++) and (--)
QVFMRIME (a helix in 1dldA)
Predicted to be helix with confidence = 0.86
Zero helix found in 17ns trajectory. What does it fold into?
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
an outlier
Protein sequence grammar
1. Letters: Amino acid profiles
2. Words: I-sites motifs
3. Phrases:
4. Sentences
Protein sequence grammar
1. Letters: Amino acid profiles
2. Words: I-sites motifs
3. Phrases: a hidden Markov model
4. Sentences
Motif “grammar”?
Arrangement of I-sites motifs in proteins is highly non-random
helixhelix cap
betastrand
betaturn
The dependencies can be modeled as a Markov chain
the
mailman
dog
bit
kicked back
The dog bit the mailman. The mailman kicked the dog back.
Markov model
Sequence data
Stochastic output The dog back. The mailman kicked the mailman kicked the dog bit the dog bit the dog bit the mailman kicked the dog. ...
How to make a Markov chain
A "hidden" Markov model
What's "hidden" about it?
An HMM is a Markov chain where the meaning of the Markov state is probabilistic.
the
mailman 0.5postman 0.5
dog
bit 0.3attacked 0.7
kicked 0.6hit 0.4
back
The dog bit the mailman. The mailman kicked the dog back. The dog attacked the postman. The postman hit the dog.
hidden Markov model
Sequence alignment data
Stochastic output The dog back. The mailman kicked the postman kicked the dog bit the dog bit the dog attacked the mailman kicked the dog. ...
How to make a hidden Markov chain
One Markov state from HMMSTR
ahi
aij
aik
regions
sequence profile
One state emits one letter of each type (b,r,d,c)
probabilitic meaning of the state
amino acid symbols
structure
symbols
bi = {ACDEF...}
ri = {HGEBdblLex}
di = {HST}
ci = {mnhd...}{previous letter(s)
next letter(s)
Constructing a HMM by aligning motifs
Related motifs, branched model.φ -1TypeG -C cap
-2TypeG -C caphelix-2TypeG -C caphelix-1TypeG -C cap
Merging many motifs into one HMM
HMMSTR
Hidden Markov Model for local protein STRucture
282 nodes
317 transitions
Unified model for 31 distinct sequence-structure motifs
(Bystroff & Baker, J. Mol. Biol., 2000)
Variations on a motif theme are modeled as parallel paths
Multiple state-pathways for the helix N-cap motif
Common sub-graphs represent common sub-structures
These peptide segments have the same state sequence (except shaded residues)
How an HMM works
P Q |S( ) =πq1 (S1) aqi −1qi bqi (Si )i=2,N∏
initiation probability
transition probability
emission probability
We have S (the sequence). We want Q (the 1D structure), and P (how well S fits Q)
3-state secondary structure prediction
74.9% correct
74.6% correct
Predicting super-secondary context
Results are for the independent test set.
Fully-automated tertiary structure prediction
(1) Find homologues in the database (Psi-Blast)
(2) Predict local structure (HMMSTR)
(3) Assemble fragments (ROSETTA, D.Baker)
sequence
structure
Protocol used for CAFASP2 experiment (2000)
Rosetta ab initio
Scoring function: Bayesian classification of pairwise secondary structure contact types.
Search function: Monte Carlo fragment insertion. A move consists of selecting a fragment at random from a set of local structure predictions. Coordinates are re-generated after swapping in the new fragment.
(Simons et al, PNAS, 1997)
CASP3 Prediction results for Target 56 : DNA helicase
Predicted structure of 66-residue fragment (23-88)
True structure of same fragment
CAFASP Prediction results for Target 122: 1GEQ Tryptophan Synthase
Predicted 97-residue fragment
True structure of same fragment
Protein sequence grammer
1. Alphabet: amino acid profile
2. Words: I-sites motifs
3. Phrases: HMMSTR pathways
4. Sentences: contact maps
the next step...
In progress:Data mining of contact maps
HMMSTR predictions
Protein sequences + contact maps
Association-rule mining (M. Zaki)
Rules for tertiary contacts
Predicting tertiary contacts
Contact predictions for 2igd
overall : 20% coverage w/20% accuracy
Can the 2D map be translated to 3D?
I-sites/HMMSTR collaborators
David Baker U. WashingtonKaren Han UCSFVestienn Thorsson U.WashingtonQian Yi U. WashingtonEdward Thayer ZymogeneticsShekhar Garde RPIMohammed Zaki RPISusan Baxter Wadsworth (->Novartis)Chip Lawrence Wadsworth/RPIBobbie Jo Webb WadsworthKim Simons U. Washington (->Harvard)
Bystroff Lab
Yu Shao
Xin Yuan
Jerry Huang
isites.bio.rpi.edu