bayesian optimization for synthetic gene...
TRANSCRIPT
![Page 1: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/1.jpg)
Bayesian Optimization for Synthetic GeneDesign
Javier Gonzalez
DDHL workshop, Nottingham, UK, 2015
![Page 2: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/2.jpg)
At the University of Sheffield:
I Department of Computer Science.
I Department of Chemical and Biological Engineering.
Neil D. Lawrence, David James, Joseph Longworth, Paul Dobson,Josselin Noirel and Mark Dickman
![Page 3: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/3.jpg)
The big picture
I 8 of the 10 top-selling drugs in are biologics (monoclonalantibodies) used in rheumatology, dermatology, and varioustypes of Cancer.
I Huge market of $73 billion just in Europe.
I Growing interest in the availability of biosimilars.
![Page 4: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/4.jpg)
New drug production paradigm
I Use mammalian cells to make protein products.I Control the ability of the cell-factory to use synthetic DNA.
Cornerstone of modern biotechnology: Design DNA code that willbest enable the cell-factory to operate most efficiently.
Synthetic genes!
![Page 5: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/5.jpg)
How does a cell work?
A mammalian cell in numbers:
I approx. 20,000 genes able to produce 20,000 proteins.
I A few of them are of therapeutical interest.
I The average gene length is 7902 bases pairs (A,T G, C).
I Millions of molecules interactions.
Central dogma of systems biology
’Natural’ genes are not optimized to maximize protein production.
![Page 6: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/6.jpg)
Why can we rewrite the genetic code?
Considerations
I Different gene sequences may encode the same protein...
I ...but the sequence affects the synthesis efficiency.
I The codon usage is the key (codon = triplet of bases).
The genetic code is redundant:
UUGACA = UUGACU
Both genes encode the same protein.
![Page 7: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/7.jpg)
Key question
Given a protein of interest, which is the recombinant genesequence that will enable the cell to produce it in the most
efficient way?
I Average mamalian gene: 7000 nucleotides.
I Consider a gene with coding region of 900 nucleotides: 300codons.
I Assume only pairs of synonymous codons.
I ≈ 2300 ≈ 2× 1090 possible recombinant gene alternatives (inthe order of the number of atoms in the universe).
![Page 8: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/8.jpg)
Key question
Given a protein of interest, which is the recombinant genesequence that will enable the cell to produce it in the most
efficient way?
I Average mamalian gene: 7000 nucleotides.
I Consider a gene with coding region of 900 nucleotides: 300codons.
I Assume only pairs of synonymous codons.
I ≈ 2300 ≈ 2× 1090 possible recombinant gene alternatives (inthe order of the number of atoms in the universe).
![Page 9: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/9.jpg)
Key question
Given a protein of interest, which is the recombinant genesequence that will enable the cell to produce it in the most
efficient way?
I Average mamalian gene: 7000 nucleotides.
I Consider a gene with coding region of 900 nucleotides: 300codons.
I Assume only pairs of synonymous codons.
I ≈ 2300 ≈ 2× 1090 possible recombinant gene alternatives (inthe order of the number of atoms in the universe).
![Page 10: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/10.jpg)
Key question
Given a protein of interest, which is the recombinant genesequence that will enable the cell to produce it in the most
efficient way?
I Average mamalian gene: 7000 nucleotides.
I Consider a gene with coding region of 900 nucleotides: 300codons.
I Assume only pairs of synonymous codons.
I ≈ 2300 ≈ 2× 1090 possible recombinant gene alternatives (inthe order of the number of atoms in the universe).
![Page 11: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/11.jpg)
Key question
Given a protein of interest, which is the recombinant genesequence that will enable the cell to produce it in the most
efficient way?
I Average mamalian gene: 7000 nucleotides.
I Consider a gene with coding region of 900 nucleotides: 300codons.
I Assume only pairs of synonymous codons.
I ≈ 2300 ≈ 2× 1090 possible recombinant gene alternatives (inthe order of the number of atoms in the universe).
![Page 12: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/12.jpg)
Machine Learning challenges
I Very complex cell behaviour. Limited prior knowledge.
I Multi-task optimization problem: increase cell efficiency,maintain cell survival, control protein and mRNA stability.
I Lab experiments are very expensive.
I Gene tests can be run in parallel.
I The design space is defined in terms of long string sequences.I Alternative: gene features, high dimensional problem.
![Page 13: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/13.jpg)
Tools: Gaussian processes
Gaussian Process: Probability density over functions, such thateach linear finite dimensional restriction is multivariate Gaussian.
0 5 10 15
t−4
−3
−2
−1
0
1
2
3
4
f
I Fully parametrized by a covariance function K .I Close-form posterior under Gaussian likelihoods.
![Page 14: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/14.jpg)
Tools: Gaussian processes
Gaussian Process: Probability density over functions, such thateach linear finite dimensional restriction is multivariate Gaussian.
0 5 10 15
t−4
−3
−2
−1
0
1
2
3
4
f
I Fully parametrized by a covariance function K .I Close-form posterior under Gaussian likelihoods.
![Page 15: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/15.jpg)
Tools: Gaussian process
Gaussian Processes: Probability density over functions, such thateach linear finite dimensional restriction is multivariate Gaussian.
0 5 10 15
t−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
f
I Fully parametrized by a covariance function K .I Close-form posterior under Gaussian likelihoods.
![Page 16: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/16.jpg)
Tools: Gaussian processes
Gaussian Process: Probability density over functions, such thateach linear finite dimensional restriction is multivariate Gaussian.
0 5 10 15
t−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
I Fully parametrized by a covariance function K .I Close-form posterior under Gaussian likelihoods.
![Page 17: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/17.jpg)
Tools: Bayesian Optimization
BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].
Example: x∗ = arg min[0,1] f (x)?
![Page 18: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/18.jpg)
Tools: Bayesian Optimization
BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].
Example: x∗ = arg min[0,1] f (x)?
Expected Improvement: Ex [max(0, f (xmin)− f (x))]
![Page 19: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/19.jpg)
Tools: Bayesian Optimization
BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].
Example: x∗ = arg min[0,1] f (x)?
Expected Improvement: Ex [max(0, f (xmin)− f (x))]
![Page 20: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/20.jpg)
Tools: Bayesian Optimization
BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].
Example: x∗ = arg min[0,1] f (x)?
Expected Improvement: Ex [max(0, f (xmin)− f (x))]
![Page 21: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/21.jpg)
Tools: Bayesian Optimization
BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].
Example: x∗ = arg min[0,1] f (x)?
Expected Improvement: Ex [max(0, f (xmin)− f (x))]
![Page 22: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/22.jpg)
Tools: Bayesian Optimization
BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].
Example: x∗ = arg min[0,1] f (x)?
Expected Improvement: Ex [max(0, f (xmin)− f (x))]
![Page 23: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/23.jpg)
Tools: Bayesian Optimization
BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].
Example: x∗ = arg min[0,1] f (x)?
Expected Improvement: Ex [max(0, f (xmin)− f (x))]
![Page 24: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/24.jpg)
Tools: Bayesian Optimization
BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].
Example: x∗ = arg min[0,1] f (x)?
Expected Improvement: Ex [max(0, f (xmin)− f (x))]
![Page 25: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/25.jpg)
Tools: Bayesian Optimization
BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].
Example: x∗ = arg min[0,1] f (x)?
Expected Improvement: Ex [max(0, f (xmin)− f (x))]
![Page 26: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/26.jpg)
Tools: Bayesian Optimization
BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].
Example: x∗ = arg min[0,1] f (x)?
Expected Improvement: Ex [max(0, f (xmin)− f (x))]
![Page 27: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/27.jpg)
How to design a synthetic gene?
A good model is crucial
Gene sequence features → protein production efficiency.
Bayesian Optimization principles for gene design
do:
1. Build a GP model as an emulator of the cell behavior.
2. Obtain a set of gene design rules (features optimization).
3. Design one/many new gene/s coherent with the design rules.
4. Test genes in the lab (get new data).
until the gene is optimized (or the budget is over...).
![Page 28: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/28.jpg)
Model as an emulator of the cell behavior
Model inputsFeatures (xi ) extracted gene se-quences (si ): codon frequency, cai,gene length, folding energy, etc.
Model outputsTranslation and trasncriptions ratesf := (fα, fβ).
Model typeMulti-output Gaussian process f ≈GP(m,K) where K is a corregion-alization covariance for the two-output model (+ SE with ARD).
The correlation in the outputs help!
![Page 29: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/29.jpg)
Obtaining optimal gene design rules
Maximize the averaged Expected improvement for both outputs[Swersky et al. 2013]
α(x) = σ(x)(−uΦ(−u) + φ(u))
where u = (ymax − m(x))/σ(x) and
m(x) =1
2
∑l=α,β
f∗(x), σ2(x) =1
22
∑l ,l ′=α,β
(K∗(x, x))l ,l ′ .
A batch method is used when several experiments can be run inparallel
![Page 30: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/30.jpg)
Designing new genes coherent with the optimal design rules
Simulating-matching approach:
1. Simulate genes ‘coherent’ with the target (same aminoacids).
2. Extract features.
3. Rank synthetic genes according to their similarity with the‘optimal’ design rules.
Ranking criterion: eval(s|x?) = ∑pj=1 wj |xj − x?j |
I x?: optimal gene design rules.
I s, xj generated ‘synonyms sequence’ and its features.
I wj : weights of the p features (inverse lengthscales of themodel covariance).
![Page 31: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/31.jpg)
Experiments
I Optimization gene designs in mammalian cells.
I Dataset in Schwanhausser et al. (2011) for 3810 genes rates.Sequences were extracted fromhttp:wet-labpic/www.ensembl.org.
I 250 features involving 5’UTR, 3’UTR and coding region.
I Gaussian process with ARD and coregionalized outputs.
I Selection of 10 random difficult-to-express genes (average logratio < 1.5).
I 10,000 random ‘synonyms sequences’ generated from eachgene.
![Page 32: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/32.jpg)
Features importance
![Page 33: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/33.jpg)
Predictions
The model is able to predict translation rates:
![Page 34: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/34.jpg)
Results for 10 low-expressed genes
Results from simulation: currently testing the results in the lab!
![Page 35: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/35.jpg)
Results for the CMV-pp65 Recombinant Protein
Alternative model: translation rates + mRNA half life.
s half-life CMV mRNA half-lifeCMV mRNA half-life
0 2000 4000 6000 8000 10000
Gene design
4
5
6
7
8
9
10
Tra
nsl
ati
on r
ate
(lo
g2
)
CMV - mRNA translation
Average prediction
1*stevd
Wild type
Predicted results for the 10,000 simulated sequences.
![Page 36: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/36.jpg)
Results for the CMV-pp65 Recombinant Protein
2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8
Half-life(log2)
3
4
5
6
7
8
9
10
Tra
nsl
ati
on r
ate
(lo
g2)
Pareto frontier
CMV - Optimal genes
Optimal genes
In silico designed genes
Multi-objective optimization problem.
![Page 37: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/37.jpg)
Prediction/design web tool
Web app for gene design based on
I GPy (https://github.com/SheffieldML/GPy).
I GPyOpt (https://github.com/SheffieldML/GPyOpt).
![Page 38: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department](https://reader036.vdocuments.us/reader036/viewer/2022071422/611be17ec84ec901c30cfde3/html5/thumbnails/38.jpg)
Final remarks
I Bayesian optimization is a promising technique to designsynthetic genes: reduces drastically the number of neededexperiments.
I Very important aspect of the problem → to have a goodsurrogate model for the cell behavior.
I Currently, working out a model with more outputs, such asthe protein stability and cell survival.
I Alternative approach: focus on the direct optimization of thesequences. Combinatorial problem.