bayesian optimization for synthetic gene...
Post on 21-Mar-2021
1 Views
Preview:
TRANSCRIPT
Bayesian Optimization for Synthetic GeneDesign
Javier Gonzalez
DDHL workshop, Nottingham, UK, 2015
At the University of Sheffield:
I Department of Computer Science.
I Department of Chemical and Biological Engineering.
Neil D. Lawrence, David James, Joseph Longworth, Paul Dobson,Josselin Noirel and Mark Dickman
The big picture
I 8 of the 10 top-selling drugs in are biologics (monoclonalantibodies) used in rheumatology, dermatology, and varioustypes of Cancer.
I Huge market of $73 billion just in Europe.
I Growing interest in the availability of biosimilars.
New drug production paradigm
I Use mammalian cells to make protein products.I Control the ability of the cell-factory to use synthetic DNA.
Cornerstone of modern biotechnology: Design DNA code that willbest enable the cell-factory to operate most efficiently.
Synthetic genes!
How does a cell work?
A mammalian cell in numbers:
I approx. 20,000 genes able to produce 20,000 proteins.
I A few of them are of therapeutical interest.
I The average gene length is 7902 bases pairs (A,T G, C).
I Millions of molecules interactions.
Central dogma of systems biology
’Natural’ genes are not optimized to maximize protein production.
Why can we rewrite the genetic code?
Considerations
I Different gene sequences may encode the same protein...
I ...but the sequence affects the synthesis efficiency.
I The codon usage is the key (codon = triplet of bases).
The genetic code is redundant:
UUGACA = UUGACU
Both genes encode the same protein.
Key question
Given a protein of interest, which is the recombinant genesequence that will enable the cell to produce it in the most
efficient way?
I Average mamalian gene: 7000 nucleotides.
I Consider a gene with coding region of 900 nucleotides: 300codons.
I Assume only pairs of synonymous codons.
I ≈ 2300 ≈ 2× 1090 possible recombinant gene alternatives (inthe order of the number of atoms in the universe).
Key question
Given a protein of interest, which is the recombinant genesequence that will enable the cell to produce it in the most
efficient way?
I Average mamalian gene: 7000 nucleotides.
I Consider a gene with coding region of 900 nucleotides: 300codons.
I Assume only pairs of synonymous codons.
I ≈ 2300 ≈ 2× 1090 possible recombinant gene alternatives (inthe order of the number of atoms in the universe).
Key question
Given a protein of interest, which is the recombinant genesequence that will enable the cell to produce it in the most
efficient way?
I Average mamalian gene: 7000 nucleotides.
I Consider a gene with coding region of 900 nucleotides: 300codons.
I Assume only pairs of synonymous codons.
I ≈ 2300 ≈ 2× 1090 possible recombinant gene alternatives (inthe order of the number of atoms in the universe).
Key question
Given a protein of interest, which is the recombinant genesequence that will enable the cell to produce it in the most
efficient way?
I Average mamalian gene: 7000 nucleotides.
I Consider a gene with coding region of 900 nucleotides: 300codons.
I Assume only pairs of synonymous codons.
I ≈ 2300 ≈ 2× 1090 possible recombinant gene alternatives (inthe order of the number of atoms in the universe).
Key question
Given a protein of interest, which is the recombinant genesequence that will enable the cell to produce it in the most
efficient way?
I Average mamalian gene: 7000 nucleotides.
I Consider a gene with coding region of 900 nucleotides: 300codons.
I Assume only pairs of synonymous codons.
I ≈ 2300 ≈ 2× 1090 possible recombinant gene alternatives (inthe order of the number of atoms in the universe).
Machine Learning challenges
I Very complex cell behaviour. Limited prior knowledge.
I Multi-task optimization problem: increase cell efficiency,maintain cell survival, control protein and mRNA stability.
I Lab experiments are very expensive.
I Gene tests can be run in parallel.
I The design space is defined in terms of long string sequences.I Alternative: gene features, high dimensional problem.
Tools: Gaussian processes
Gaussian Process: Probability density over functions, such thateach linear finite dimensional restriction is multivariate Gaussian.
0 5 10 15
t−4
−3
−2
−1
0
1
2
3
4
f
I Fully parametrized by a covariance function K .I Close-form posterior under Gaussian likelihoods.
Tools: Gaussian processes
Gaussian Process: Probability density over functions, such thateach linear finite dimensional restriction is multivariate Gaussian.
0 5 10 15
t−4
−3
−2
−1
0
1
2
3
4
f
I Fully parametrized by a covariance function K .I Close-form posterior under Gaussian likelihoods.
Tools: Gaussian process
Gaussian Processes: Probability density over functions, such thateach linear finite dimensional restriction is multivariate Gaussian.
0 5 10 15
t−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
f
I Fully parametrized by a covariance function K .I Close-form posterior under Gaussian likelihoods.
Tools: Gaussian processes
Gaussian Process: Probability density over functions, such thateach linear finite dimensional restriction is multivariate Gaussian.
0 5 10 15
t−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
I Fully parametrized by a covariance function K .I Close-form posterior under Gaussian likelihoods.
Tools: Bayesian Optimization
BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].
Example: x∗ = arg min[0,1] f (x)?
Tools: Bayesian Optimization
BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].
Example: x∗ = arg min[0,1] f (x)?
Expected Improvement: Ex [max(0, f (xmin)− f (x))]
Tools: Bayesian Optimization
BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].
Example: x∗ = arg min[0,1] f (x)?
Expected Improvement: Ex [max(0, f (xmin)− f (x))]
Tools: Bayesian Optimization
BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].
Example: x∗ = arg min[0,1] f (x)?
Expected Improvement: Ex [max(0, f (xmin)− f (x))]
Tools: Bayesian Optimization
BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].
Example: x∗ = arg min[0,1] f (x)?
Expected Improvement: Ex [max(0, f (xmin)− f (x))]
Tools: Bayesian Optimization
BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].
Example: x∗ = arg min[0,1] f (x)?
Expected Improvement: Ex [max(0, f (xmin)− f (x))]
Tools: Bayesian Optimization
BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].
Example: x∗ = arg min[0,1] f (x)?
Expected Improvement: Ex [max(0, f (xmin)− f (x))]
Tools: Bayesian Optimization
BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].
Example: x∗ = arg min[0,1] f (x)?
Expected Improvement: Ex [max(0, f (xmin)− f (x))]
Tools: Bayesian Optimization
BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].
Example: x∗ = arg min[0,1] f (x)?
Expected Improvement: Ex [max(0, f (xmin)− f (x))]
Tools: Bayesian Optimization
BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].
Example: x∗ = arg min[0,1] f (x)?
Expected Improvement: Ex [max(0, f (xmin)− f (x))]
How to design a synthetic gene?
A good model is crucial
Gene sequence features → protein production efficiency.
Bayesian Optimization principles for gene design
do:
1. Build a GP model as an emulator of the cell behavior.
2. Obtain a set of gene design rules (features optimization).
3. Design one/many new gene/s coherent with the design rules.
4. Test genes in the lab (get new data).
until the gene is optimized (or the budget is over...).
Model as an emulator of the cell behavior
Model inputsFeatures (xi ) extracted gene se-quences (si ): codon frequency, cai,gene length, folding energy, etc.
Model outputsTranslation and trasncriptions ratesf := (fα, fβ).
Model typeMulti-output Gaussian process f ≈GP(m,K) where K is a corregion-alization covariance for the two-output model (+ SE with ARD).
The correlation in the outputs help!
Obtaining optimal gene design rules
Maximize the averaged Expected improvement for both outputs[Swersky et al. 2013]
α(x) = σ(x)(−uΦ(−u) + φ(u))
where u = (ymax − m(x))/σ(x) and
m(x) =1
2
∑l=α,β
f∗(x), σ2(x) =1
22
∑l ,l ′=α,β
(K∗(x, x))l ,l ′ .
A batch method is used when several experiments can be run inparallel
Designing new genes coherent with the optimal design rules
Simulating-matching approach:
1. Simulate genes ‘coherent’ with the target (same aminoacids).
2. Extract features.
3. Rank synthetic genes according to their similarity with the‘optimal’ design rules.
Ranking criterion: eval(s|x?) = ∑pj=1 wj |xj − x?j |
I x?: optimal gene design rules.
I s, xj generated ‘synonyms sequence’ and its features.
I wj : weights of the p features (inverse lengthscales of themodel covariance).
Experiments
I Optimization gene designs in mammalian cells.
I Dataset in Schwanhausser et al. (2011) for 3810 genes rates.Sequences were extracted fromhttp:wet-labpic/www.ensembl.org.
I 250 features involving 5’UTR, 3’UTR and coding region.
I Gaussian process with ARD and coregionalized outputs.
I Selection of 10 random difficult-to-express genes (average logratio < 1.5).
I 10,000 random ‘synonyms sequences’ generated from eachgene.
Features importance
Predictions
The model is able to predict translation rates:
Results for 10 low-expressed genes
Results from simulation: currently testing the results in the lab!
Results for the CMV-pp65 Recombinant Protein
Alternative model: translation rates + mRNA half life.
s half-life CMV mRNA half-lifeCMV mRNA half-life
0 2000 4000 6000 8000 10000
Gene design
4
5
6
7
8
9
10
Tra
nsl
ati
on r
ate
(lo
g2
)
CMV - mRNA translation
Average prediction
1*stevd
Wild type
Predicted results for the 10,000 simulated sequences.
Results for the CMV-pp65 Recombinant Protein
2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8
Half-life(log2)
3
4
5
6
7
8
9
10
Tra
nsl
ati
on r
ate
(lo
g2)
Pareto frontier
CMV - Optimal genes
Optimal genes
In silico designed genes
Multi-objective optimization problem.
Prediction/design web tool
Web app for gene design based on
I GPy (https://github.com/SheffieldML/GPy).
I GPyOpt (https://github.com/SheffieldML/GPyOpt).
Final remarks
I Bayesian optimization is a promising technique to designsynthetic genes: reduces drastically the number of neededexperiments.
I Very important aspect of the problem → to have a goodsurrogate model for the cell behavior.
I Currently, working out a model with more outputs, such asthe protein stability and cell survival.
I Alternative approach: focus on the direct optimization of thesequences. Combinatorial problem.
top related