bayesian optimization for synthetic gene...

Post on 21-Mar-2021

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Bayesian Optimization for Synthetic GeneDesign

Javier Gonzalez

DDHL workshop, Nottingham, UK, 2015

At the University of Sheffield:

I Department of Computer Science.

I Department of Chemical and Biological Engineering.

Neil D. Lawrence, David James, Joseph Longworth, Paul Dobson,Josselin Noirel and Mark Dickman

The big picture

I 8 of the 10 top-selling drugs in are biologics (monoclonalantibodies) used in rheumatology, dermatology, and varioustypes of Cancer.

I Huge market of $73 billion just in Europe.

I Growing interest in the availability of biosimilars.

New drug production paradigm

I Use mammalian cells to make protein products.I Control the ability of the cell-factory to use synthetic DNA.

Cornerstone of modern biotechnology: Design DNA code that willbest enable the cell-factory to operate most efficiently.

Synthetic genes!

How does a cell work?

A mammalian cell in numbers:

I approx. 20,000 genes able to produce 20,000 proteins.

I A few of them are of therapeutical interest.

I The average gene length is 7902 bases pairs (A,T G, C).

I Millions of molecules interactions.

Central dogma of systems biology

’Natural’ genes are not optimized to maximize protein production.

Why can we rewrite the genetic code?

Considerations

I Different gene sequences may encode the same protein...

I ...but the sequence affects the synthesis efficiency.

I The codon usage is the key (codon = triplet of bases).

The genetic code is redundant:

UUGACA = UUGACU

Both genes encode the same protein.

Key question

Given a protein of interest, which is the recombinant genesequence that will enable the cell to produce it in the most

efficient way?

I Average mamalian gene: 7000 nucleotides.

I Consider a gene with coding region of 900 nucleotides: 300codons.

I Assume only pairs of synonymous codons.

I ≈ 2300 ≈ 2× 1090 possible recombinant gene alternatives (inthe order of the number of atoms in the universe).

Key question

Given a protein of interest, which is the recombinant genesequence that will enable the cell to produce it in the most

efficient way?

I Average mamalian gene: 7000 nucleotides.

I Consider a gene with coding region of 900 nucleotides: 300codons.

I Assume only pairs of synonymous codons.

I ≈ 2300 ≈ 2× 1090 possible recombinant gene alternatives (inthe order of the number of atoms in the universe).

Key question

Given a protein of interest, which is the recombinant genesequence that will enable the cell to produce it in the most

efficient way?

I Average mamalian gene: 7000 nucleotides.

I Consider a gene with coding region of 900 nucleotides: 300codons.

I Assume only pairs of synonymous codons.

I ≈ 2300 ≈ 2× 1090 possible recombinant gene alternatives (inthe order of the number of atoms in the universe).

Key question

Given a protein of interest, which is the recombinant genesequence that will enable the cell to produce it in the most

efficient way?

I Average mamalian gene: 7000 nucleotides.

I Consider a gene with coding region of 900 nucleotides: 300codons.

I Assume only pairs of synonymous codons.

I ≈ 2300 ≈ 2× 1090 possible recombinant gene alternatives (inthe order of the number of atoms in the universe).

Key question

Given a protein of interest, which is the recombinant genesequence that will enable the cell to produce it in the most

efficient way?

I Average mamalian gene: 7000 nucleotides.

I Consider a gene with coding region of 900 nucleotides: 300codons.

I Assume only pairs of synonymous codons.

I ≈ 2300 ≈ 2× 1090 possible recombinant gene alternatives (inthe order of the number of atoms in the universe).

Machine Learning challenges

I Very complex cell behaviour. Limited prior knowledge.

I Multi-task optimization problem: increase cell efficiency,maintain cell survival, control protein and mRNA stability.

I Lab experiments are very expensive.

I Gene tests can be run in parallel.

I The design space is defined in terms of long string sequences.I Alternative: gene features, high dimensional problem.

Tools: Gaussian processes

Gaussian Process: Probability density over functions, such thateach linear finite dimensional restriction is multivariate Gaussian.

0 5 10 15

t−4

−3

−2

−1

0

1

2

3

4

f

I Fully parametrized by a covariance function K .I Close-form posterior under Gaussian likelihoods.

Tools: Gaussian processes

Gaussian Process: Probability density over functions, such thateach linear finite dimensional restriction is multivariate Gaussian.

0 5 10 15

t−4

−3

−2

−1

0

1

2

3

4

f

I Fully parametrized by a covariance function K .I Close-form posterior under Gaussian likelihoods.

Tools: Gaussian process

Gaussian Processes: Probability density over functions, such thateach linear finite dimensional restriction is multivariate Gaussian.

0 5 10 15

t−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

f

I Fully parametrized by a covariance function K .I Close-form posterior under Gaussian likelihoods.

Tools: Gaussian processes

Gaussian Process: Probability density over functions, such thateach linear finite dimensional restriction is multivariate Gaussian.

0 5 10 15

t−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

I Fully parametrized by a covariance function K .I Close-form posterior under Gaussian likelihoods.

Tools: Bayesian Optimization

BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].

Example: x∗ = arg min[0,1] f (x)?

Tools: Bayesian Optimization

BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].

Example: x∗ = arg min[0,1] f (x)?

Expected Improvement: Ex [max(0, f (xmin)− f (x))]

Tools: Bayesian Optimization

BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].

Example: x∗ = arg min[0,1] f (x)?

Expected Improvement: Ex [max(0, f (xmin)− f (x))]

Tools: Bayesian Optimization

BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].

Example: x∗ = arg min[0,1] f (x)?

Expected Improvement: Ex [max(0, f (xmin)− f (x))]

Tools: Bayesian Optimization

BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].

Example: x∗ = arg min[0,1] f (x)?

Expected Improvement: Ex [max(0, f (xmin)− f (x))]

Tools: Bayesian Optimization

BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].

Example: x∗ = arg min[0,1] f (x)?

Expected Improvement: Ex [max(0, f (xmin)− f (x))]

Tools: Bayesian Optimization

BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].

Example: x∗ = arg min[0,1] f (x)?

Expected Improvement: Ex [max(0, f (xmin)− f (x))]

Tools: Bayesian Optimization

BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].

Example: x∗ = arg min[0,1] f (x)?

Expected Improvement: Ex [max(0, f (xmin)− f (x))]

Tools: Bayesian Optimization

BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].

Example: x∗ = arg min[0,1] f (x)?

Expected Improvement: Ex [max(0, f (xmin)− f (x))]

Tools: Bayesian Optimization

BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].

Example: x∗ = arg min[0,1] f (x)?

Expected Improvement: Ex [max(0, f (xmin)− f (x))]

How to design a synthetic gene?

A good model is crucial

Gene sequence features → protein production efficiency.

Bayesian Optimization principles for gene design

do:

1. Build a GP model as an emulator of the cell behavior.

2. Obtain a set of gene design rules (features optimization).

3. Design one/many new gene/s coherent with the design rules.

4. Test genes in the lab (get new data).

until the gene is optimized (or the budget is over...).

Model as an emulator of the cell behavior

Model inputsFeatures (xi ) extracted gene se-quences (si ): codon frequency, cai,gene length, folding energy, etc.

Model outputsTranslation and trasncriptions ratesf := (fα, fβ).

Model typeMulti-output Gaussian process f ≈GP(m,K) where K is a corregion-alization covariance for the two-output model (+ SE with ARD).

The correlation in the outputs help!

Obtaining optimal gene design rules

Maximize the averaged Expected improvement for both outputs[Swersky et al. 2013]

α(x) = σ(x)(−uΦ(−u) + φ(u))

where u = (ymax − m(x))/σ(x) and

m(x) =1

2

∑l=α,β

f∗(x), σ2(x) =1

22

∑l ,l ′=α,β

(K∗(x, x))l ,l ′ .

A batch method is used when several experiments can be run inparallel

Designing new genes coherent with the optimal design rules

Simulating-matching approach:

1. Simulate genes ‘coherent’ with the target (same aminoacids).

2. Extract features.

3. Rank synthetic genes according to their similarity with the‘optimal’ design rules.

Ranking criterion: eval(s|x?) = ∑pj=1 wj |xj − x?j |

I x?: optimal gene design rules.

I s, xj generated ‘synonyms sequence’ and its features.

I wj : weights of the p features (inverse lengthscales of themodel covariance).

Experiments

I Optimization gene designs in mammalian cells.

I Dataset in Schwanhausser et al. (2011) for 3810 genes rates.Sequences were extracted fromhttp:wet-labpic/www.ensembl.org.

I 250 features involving 5’UTR, 3’UTR and coding region.

I Gaussian process with ARD and coregionalized outputs.

I Selection of 10 random difficult-to-express genes (average logratio < 1.5).

I 10,000 random ‘synonyms sequences’ generated from eachgene.

Features importance

Predictions

The model is able to predict translation rates:

Results for 10 low-expressed genes

Results from simulation: currently testing the results in the lab!

Results for the CMV-pp65 Recombinant Protein

Alternative model: translation rates + mRNA half life.

s half-life CMV mRNA half-lifeCMV mRNA half-life

0 2000 4000 6000 8000 10000

Gene design

4

5

6

7

8

9

10

Tra

nsl

ati

on r

ate

(lo

g2

)

CMV - mRNA translation

Average prediction

1*stevd

Wild type

Predicted results for the 10,000 simulated sequences.

Results for the CMV-pp65 Recombinant Protein

2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8

Half-life(log2)

3

4

5

6

7

8

9

10

Tra

nsl

ati

on r

ate

(lo

g2)

Pareto frontier

CMV - Optimal genes

Optimal genes

In silico designed genes

Multi-objective optimization problem.

Prediction/design web tool

Web app for gene design based on

I GPy (https://github.com/SheffieldML/GPy).

I GPyOpt (https://github.com/SheffieldML/GPyOpt).

Final remarks

I Bayesian optimization is a promising technique to designsynthetic genes: reduces drastically the number of neededexperiments.

I Very important aspect of the problem → to have a goodsurrogate model for the cell behavior.

I Currently, working out a model with more outputs, such asthe protein stability and cell survival.

I Alternative approach: focus on the direct optimization of thesequences. Combinatorial problem.

top related