the building blocks of life. built for you
DESCRIPTION
Putting Engineering back into Protein Engineering Jun Liao, UC Santa Cruz Manfred K. Warmuth, UC Santa Cruz Jeremy Minshull, DNA 2.0. THE BUILDING BLOCKS OF LIFE. BUILT FOR YOU. Protein Engineering Current Paradigms. Mechanism-based (Rational) detailed structural analysis Empiricism-based - PowerPoint PPT PresentationTRANSCRIPT
THE BUILDING BLOCKS OF LIFE. BUILT FOR YOU
Putting Engineering back into Protein Engineering
Jun Liao, UC Santa CruzManfred K. Warmuth, UC Santa Cruz
Jeremy Minshull, DNA 2.0
Protein Engineering Current Paradigms
1. Mechanism-based– (Rational) detailed structural analysis
2. Empiricism-based– (Non-rational ) libraries based
Mechanism-Based Protein Engineering
Based on thermodynamic principles• Calculations are approximate
– calculation cost– structures are really not rigid (MDS)
• Calculations are primarily able to predict binding – catalysis is a special case of binding to a transition state
• Changes in amino acids are designed based on these principles
– very small numbers (<5) of new proteins are synthesized and tested
Empiricism-Based Protein Engineering
• Uses similar principles to evolution– make many variants – screen to find those with the best properties
• No mechanistic understanding needed• Produces large numbers of variants (>1,000) which
are very difficult / expensive to screen for practically relevant properties Proteins related to wild type
Simulated cross over
New variants
The Key Challenge in Protein Engineering
=Reality
What we need is not what we assay for….
Molecular mechanistic models(does not model activity)
High throughput screens(surrogate assays)
Wish List
•No need to develop surrogate assay
•Variants are tested directly under application
conditions
•Rapid process.
Requirements•Identification of appropriate amino acid substitutions•Design and synthesis of information-rich variants•Interpretation of quantitative functional data using machine learning techniques.
What we want in Protein Engineering
Protein Engineering using Machine Learning
Initial designa) Choose substitutions. b) Design an initial variant set (<50) containing those substitutions
Reality checkSynthesize and test the variant set for function(s) of interest.
Machine learningModel the effect of sequence changes on function(s) of interest.
New designPropose a new variant set (<50) based on the model.
Iterate
End
Select the best variant(s).
Starting pointSelect a protein with some correct initial properties
Engineering of Proteinase K
• Long-term goal of engineering proteinase K to degrade polylactic acid
• Member of the serine protease family– Large amounts of phylogenetic and
sequence information available
• Several different measurable activities available for optimization
Initial designa) Choose substitutions. b) Design an initial variant set (<50) containing those substitutions
Reality checkSynthesize and test the variant set for function(s) of interest.
Machine learningModel the effect of sequence changes on function(s) of interest.
New designPropose a new variant set (<50) based on the model.
Iterate
End
Select the best variant(s).
Starting pointSelect a protein with some correct initial properties
Protein Engineering using Machine Learning
Expert System for Substitution Selection
Expert system:- Calculation of 9 independent scores that measure changes that have succeeded in other places in Nature- Weight and combine scores to pick best changes
Proteins related to proteinaseK
19 switches = search space of 219 = 500,000
? ? ? ? ?
Finding Optima in Complex Landscapes:Design of Experiment
Changing 1 amino acid at a time
Making multiple changes simultaneously
…Now try to envision doing this not with 2, but 200 amino acids / dimensions
x x xx x x xx
xx
Aa 2
Aa 1
x
xx
xx
x x
Aa 2
Aa 1
Design of Initial Proteinase K Variants95 97 107123132138145151167180194199208236237265267273293299310332337355
var C S D A V A F A I I S S H V N S I T A C K R N Swt N P S S I E M Y V L Y A K A R P V S G L I K S P1 S N T A R S2 A A A K R S3 C F I S N T4 S A I S V I5 D V H S C N6 S D I V N K7 A A S H S S8 C S I A C R9 C V A F I H10 V N T A R S11 S A S C K N12 C D A I S N13 A A I S H C14 S F N T A K15 V V S I R S16 S A S V C S17 C D I I A K18 F N S I R N19 A V A S H T20 A H V I A C21 D V A F N S22 S I S S S K23 C A I N T R24 C25 S C26 S S C27 V28 D A F29 F I I K S30 F A S K R
Initial designa) Choose substitutions. b) Design an initial variant set (<50) containing those substitutions
Back to Proteinase K
Reality checkSynthesize and test the variant set for function(s) of interest.
Machine learningModel the effect of sequence changes on function(s) of interest.
New designPropose a new variant set (<50) based on the model.
Iterate
End
Select the best variant(s).
Starting pointSelect a protein with some correct initial properties
Protein Engineering using Machine Learning
First proteinase K dataset
0
0.5
1
1.5
2
2.5
3
Activity (factor increase relative to wt)
Initial designa) Choose substitutions. b) Design an initial variant set (<50) containing those substitutions
Reality checkSynthesize and test the variant set for function(s) of interest.
Machine learningModel the effect of sequence changes on function(s) of interest.
New designPropose a new variant set (<50) based on the model.
Iterate
End
Select the best variant(s).
Starting pointSelect a protein with some correct initial properties
Protein Engineering using Machine Learning
Sequence-Activity Modeling: How Does it Work?
1. Represent the sequence as a matrixSeq1 AGRWGIGAYHKLIMASeq2 AGRTGVGVYHKLIMASeq3 AGRWGIGVYHRLIMASeq4 AGRTGVGAYHRLIMAbecomes T W V I V A R Kx x1 x2 x3 x4 x5 x6 x7 x8
Seq1 0 1 0 1 0 1 0 1 Seq2 1 0 1 0 1 0 0 1Seq3 0 1 0 1 1 0 1 0Seq4 1 0 1 0 0 1 1 0
2. Measure the activity or activities of interest under the final application conditions
3. y = c1x1 + c2x2 + c3x3 + c4x4 +… cixi
-0.5
0
0.5
1
1.5
2
2.5
3
-0.5 0 0.5 1 1.5 2
Predicted activity
Measured activity
Assessing the Proteinase K Sequence-Activity Relationship
wt
y = c1x1 + c2x2 + c3x3 + c4x4 +… cixi
Learning Methods
• Variety of regression methods– Ridge Regression & Lasso– SVM Regression & LPSVM Regression– Matching Loss Regression & One-norm Matching
Loss Regression– Partial Least Square Regression– LPBoost Regression
• Use bagging to improve the prediction stability
Variants Design I
• Main issue: Exploitation vs. Exploration
• Optimum design (Exploitation)– Take the combination of substitutions
predicted to have maximal activity– Also consider
• Substitution frequency in the dataset• Variation of weight estimation.
– Used in 2nd & 3rd iterations
Variants Design II
• Diversity design (Exploration)– Calculate the combination of
substitutions predicted to have maximal activity that is also
• No more than 5 changes from a sequence that has already been tested
• No closer than 3 changes from a sequence that has already been tested or selected for synthesis
– Used in 2nd iteration
80 90
Act
ivity
rel
ativ
e to
wild
typ
eThree Iterations of Activity Engineering
Variants in order synthesized
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 120
1st set: 34 variants
2nd set: 24 variants
3rd set: 38 variants
wild-type
100
ONLY 58 variants were tested to allow design of the fourth set, which contained •3 variants 20-30 x improved over wild-type•50% of variants more active than the best of previous sets•70% of variants more active than wild types•3-11 changes found in variants better than WT
Improving ActivityActivity Improvement
0
100
200
300
400
500
600
700
v501 v502 v503 v505 v513 v515 v518 v526 v544 v545 v551 v556 v557 v558 v560 NS9
Activity (pm
ol/s/ml)
0
2
4
6
8
10
12
14
Activity (pmol/s/ml)
Half life at 68°C (s)
Hal
f lif
e at
68°
C (
s)
107 123 132 145 151 167 180 194 199 208 237 265 267 273 293 310 332 337 355WT S S I M Y V L Y A K R P V S G I K S P501 A A
502 A H A503 A H A R
505 A H T A R N513 A I H T A R N
515 V A A518 V A I I T A
526 A V A I T A544 V A H T A N
545 V A H T A R N551 A T A
556 V A I T A557 A H A R N
558 V A H T A R560 A V A I H T A N
Variants are Improved in Multiple Properties
Conclusions• Machine learning
– Making a very small number of variants (58) allows a productive search of a total space with 500,000 possible combinations
• Synthetic Biology– Recent advances in gene synthesis methods
were essential for this type of exploration
The Future• Proteins are the building blocks of life with a wide
array of applications (therapeutics, diagnostics, industrial catalysts)
• Finding a reliable mechanism for optimizing proteins for human applications would be an amazing feat
• We steal ideas about how proteins evolve from nature, but optimize proteins outside their in vivo constraints (the proteins don’t have to be compatible with life)