Download - Predicting Expression Levels Using Codon Usage Bias

Doug RaifordLesson 19

Actually have very expensive experiments that do this

Sequence only would be nice

04/21/23 2Expression Prediction with CUB

Worked on a project that predicted metabolic efficiency Tendency for

organisms to utilize, where possible, less expensive amino acids

Tested by looking at expression vs. protein biosynthetic cost


Protein production rate (expressivity)B

iosy

nthe

tic C

ost

In highly expressed, extremely biased usage of certain codons


CTACTCCTGCTTTTATTG

Leucine

One of the most highly expressed genes in Escherichia coli K12 has 9 CTG codons and zero of all other codons that code for leucine

Translational efficiency


gc?(ala)

Ribosome

alaala

gcualaala

gcu

alaala

gcualaala

gcu

ala

gcu

alaala

gcu

alaala

gcu

alaala

gcu

alaala

gcualaala

gcu

alaala

gcu

alaala

gcc alaala

gcc

alaala

gcc

alaala

gcc

alaala

gca

mRNA

tRNAs

Protein Strand

alaala

gcggcg

How would you use this biased usage to predict expression?

Frequency of preferred codons (FOP)

Just look at most highly expressed Either experimentally

determined or genes known to be highly expressed

Calculate usage for all genes Usage predictive of

expressivity04/21/23 6Expression Prediction with CUB

Expression of Genes

12345678...N

That is, given sequence data only, can we determine probable expression levels?


1 atgggttggt caatcatctg atttaatggg caaattttta aagatgcaca ttatatcagc 61 aaaaaatcga acctgttggg tcttgcgcag ggtgccggac ttggcctagt tttgggcctc 121 aagatgacga tcaaatgacg aaagcttgcc tggtcgaggg ttttttcaac cgtcgattgc 181 gggagcgggg ttgtgcggcc gtatggcgga aatcgctatt cggttgagct gggacgatgg 241 caggacgggg agcggtgcgc ttggacacgc aaacttggca ggaacagggg ctcgaaaccc 301 ggtctccggg acgcacgcgc ggtgaaatca gccaggatga actggcgcac cagtggagcc 361 gtgttcgcgg ccgacttcag gaagaaatcg gcgaggtcga gtaccgcaac tggttgcggc 421 aagccgtgct gcatgggctc gacggcgatg aagtgactgt catgctgccg acccgcttcc 481 tgcgtgactg ggtgaacaag gaatatggca acctgctgac cgcgttctgg caggccgaga 541 acccggcggt acggcgcgtg gatatccgga cccggccggc cggcaccagc gagcgcgcgc 601 ccgacctcgc cgaggtggag ccgaagaccg cgatcgcgcg gcccgccgcc gcggcgcgcc 661 gcgaggccga ggaacgcccg gacatgagcg cgccgctcga cccgcgcttc acctttgata 721 cattcgtggt cggcaagccg aacgaattcg cctatgcctg cgcgcgccgc gtcgccgacg

Look at data in matrix If we assume that the major

force driving variance in codon usage is translational efficiency

If highly expressed genes have high usage of preferred, low usage of non-preferred, weakly expressed have more balanced usage (or even avoidance of preferred)

What does this sound like?


Can find axis of greatest variance

Genes projected on this axis

Highly expressed at one end and weakly at other


…finding which codons are preferred If codon’s usage is correlated with location

on PC… That is if genes at one end exhibit low usage and

genes at other exhibit high


Probably a preferred codon

Projection of genes on first principle component

Correlated?

Region in middle

Really more accurate look at distance from cluster


SCCI (Carbone, et al.)Looks for most self consistent set of

genes


Search for these genesSearch for these genes

Looking for subset of genes (reference set) that define a bias to which they themselves adhere more strongly than the rest of the genes


Start with all genes as reference setLoop till reference set size 1%

Determine which codons are preferredDetermine average usage for all genesSort by adherenceTake genes in top half to be the new reference setRepeat

Start with all genes as reference setLoop till reference set size 1%

Determine which codons are preferredDetermine average usage for all genesSort by adherenceTake genes in top half to be the new reference setRepeat

AlgorithmAlgorithm

Do you think all organisms have translational efficiency bias?

How would you expect metabolic efficiency trends to look in organisms that do not have?


Protein production rate (expressivity)

Bio

synt

hetic

Cos

t

?

Some actually exhibited significant and positive

trends

Some actually exhibited significant and positive

trends

What could cause a positive trend?

Organisms preferentially utilize the most expensive aa’s in the most highly expressed genes?

We decided the problem must be in our prediction of expressivity

Somehow we got it wrong—in fact, it seems we got it exactly opposite


Protein production rate (expressivity)

Bio

synt

hetic

Cos

t

?

Misbehavers were all high and low GC-content organisms

But how would this cause a positive trend

Breakthrough came with Nostoc

Greedy algorithm was finding high AT-content that were on opposite side of PCA 2D codon usage space


-8 -6 -4 -2 0 2 4 6-4

-2

0

2

4

6

8

First Principal Component

Se

con

d P

rinci

pa

l Co

mp

on

en

t

PCA on Nostocalgorithm identified reference set

vs. highly expressed

Algorithm is a search for self-consistent genes

What does search space look like—why did the algorithm get fooled

Our lab was heavy into GA’s Think of all

optimization problems in terms of being a search

Fitness landscape04/21/23 17Expression Prediction with CUB

Carbone’s algorithm found the reference set associated with the dominant bias—what about the next most dominant

Carbone’s algorithm found the reference set associated with the dominant bias—what about the next most dominant

How arrange solutions along two axes (with fitness in a third)

How reduce the number of solutions

04/21/23 Expression Prediction with CUB 18

Number of possible solutions

Number of possible solutions

Reference sets tend to be proximal

If choose nearest neighbors will only have to calculate fitness for each gene

We already have a method for viewing gene placement in a 2D space: PCA

Elevated regions: highly self-consistent04/21/23 Expression Prediction with CUB 19

-8 -6 -4 -2 0 2 4 6-4

-2

0

2

4

6

8

First Principal Component

Se

con

d P

rinci

pa

l Co

mp

on

en

t

PCA on Nostocalgorithm identified reference set

vs. highly expressed

AT-content ridge

dominates search space

AT-content ridge

dominates search space

How fix algorithm? I modified the SCCI algorithm to avoid

unbalanced GC-content regions Push down


Greedy algorithm gets perfect self-consistency scores

Modified algorithm does not

Decided to try using a GA to improve


We can rebuild him. We have the technology. We have the capability to build the world's first bionic man. Steve Austin will be that man. Better than he was before. Better, stronger, faster.

We can rebuild him. We have the technology. We have the capability to build the world's first bionic man. Steve Austin will be that man. Better than he was before. Better, stronger, faster.

Parent One

Parent Two

g1 g2 g3 g4 g5 … gN

g1 g2 g3 g4 g5 … gN

Child g1 g2 g3 g4 g5 … gN

Mutate

Searched for a set of genes that were both Self-consistent And that identified a bias to which known

highly expressed genes strongly adhered


Self-consistent

Ranki

ng o

f H

EG

s

Two ObjectivesTwo Objectives

Count the number of solutions that dominate (better in both dimensions)

Solutions on the Pareto front: no other solution is better in both dimensions

The fewer there are the higher the fitness

Genes on front given highest fitness


Self-consistent

Ran

kin

g o

f H

EG

s

Those that identified a bias to which known highly expressed genes strongly adhered was by far the best

But the reference set we identified were not among the most highly expressed… yet the bias it discovered (the codon preferences it identified) yielded much better predictions of actual expressivity


Self-consistent

Ran

kin

g o

f H

EG

s

Best SolutionsBest Solutions

We just found a better set of codon preferences

Why not directly search for codon preferences?

Reframe the problem Instead of “given a set of

known highly expressed genes, determine which codons they seem to prefer and use these preferences to rank the whole genome”

We asked “given a set of known highly expressed genes, which set of codon preferences (weights associated with each codon) yield a gene ranking with known highly expressed genes at the top”04/21/23 Expression Prediction with CUB 25

Given a set of known highly expressed genes, which set of codon preferences (weights associated with each codon) yield a gene ranking with known highly expressed genes at the top


Parent One

Parent Two

w1 w2 w3 w4 w5 …w59

w1 w2 w3 w4 w5 …w59

Child w1 w2 w3 w4 w5 …w59

Mutate

Download - Predicting Expression Levels Using Codon Usage Bias

Top Related