Doug RaifordLesson 19
Actually have very expensive experiments that do this
Sequence only would be nice
04/21/23 2Expression Prediction with CUB
Worked on a project that predicted metabolic efficiency Tendency for
organisms to utilize, where possible, less expensive amino acids
Tested by looking at expression vs. protein biosynthetic cost
04/21/23 3Expression Prediction with CUB
Protein production rate (expressivity)B
iosy
nthe
tic C
ost
In highly expressed, extremely biased usage of certain codons
04/21/23 4Expression Prediction with CUB
CTACTCCTGCTTTTATTG
Leucine
One of the most highly expressed genes in Escherichia coli K12 has 9 CTG codons and zero of all other codons that code for leucine
Translational efficiency
04/21/23 5Expression Prediction with CUB
gc?(ala)
Ribosome
alaala
gcualaala
gcu
alaala
gcualaala
gcu
ala
gcu
alaala
gcu
alaala
gcu
alaala
gcu
alaala
gcualaala
gcu
alaala
gcu
alaala
gcc alaala
gcc
alaala
gcc
alaala
gcc
alaala
gca
mRNA
tRNAs
Protein Strand
alaala
gcggcg
How would you use this biased usage to predict expression?
Frequency of preferred codons (FOP)
Just look at most highly expressed Either experimentally
determined or genes known to be highly expressed
Calculate usage for all genes Usage predictive of
expressivity04/21/23 6Expression Prediction with CUB
Expression of Genes
12345678...N
That is, given sequence data only, can we determine probable expression levels?
04/21/23 7Expression Prediction with CUB
1 atgggttggt caatcatctg atttaatggg caaattttta aagatgcaca ttatatcagc 61 aaaaaatcga acctgttggg tcttgcgcag ggtgccggac ttggcctagt tttgggcctc 121 aagatgacga tcaaatgacg aaagcttgcc tggtcgaggg ttttttcaac cgtcgattgc 181 gggagcgggg ttgtgcggcc gtatggcgga aatcgctatt cggttgagct gggacgatgg 241 caggacgggg agcggtgcgc ttggacacgc aaacttggca ggaacagggg ctcgaaaccc 301 ggtctccggg acgcacgcgc ggtgaaatca gccaggatga actggcgcac cagtggagcc 361 gtgttcgcgg ccgacttcag gaagaaatcg gcgaggtcga gtaccgcaac tggttgcggc 421 aagccgtgct gcatgggctc gacggcgatg aagtgactgt catgctgccg acccgcttcc 481 tgcgtgactg ggtgaacaag gaatatggca acctgctgac cgcgttctgg caggccgaga 541 acccggcggt acggcgcgtg gatatccgga cccggccggc cggcaccagc gagcgcgcgc 601 ccgacctcgc cgaggtggag ccgaagaccg cgatcgcgcg gcccgccgcc gcggcgcgcc 661 gcgaggccga ggaacgcccg gacatgagcg cgccgctcga cccgcgcttc acctttgata 721 cattcgtggt cggcaagccg aacgaattcg cctatgcctg cgcgcgccgc gtcgccgacg
Look at data in matrix If we assume that the major
force driving variance in codon usage is translational efficiency
If highly expressed genes have high usage of preferred, low usage of non-preferred, weakly expressed have more balanced usage (or even avoidance of preferred)
What does this sound like?
04/21/23 8Expression Prediction with CUB
Can find axis of greatest variance
Genes projected on this axis
Highly expressed at one end and weakly at other
04/21/23 9Expression Prediction with CUB
…finding which codons are preferred If codon’s usage is correlated with location
on PC… That is if genes at one end exhibit low usage and
genes at other exhibit high
04/21/23 10Expression Prediction with CUB
Probably a preferred codon
Projection of genes on first principle component
Correlated?
Region in middle
Really more accurate look at distance from cluster
04/21/23 11Expression Prediction with CUB
SCCI (Carbone, et al.)Looks for most self consistent set of
genes
04/21/23 12Expression Prediction with CUB
Search for these genesSearch for these genes
Looking for subset of genes (reference set) that define a bias to which they themselves adhere more strongly than the rest of the genes
04/21/23 13Expression Prediction with CUB
Start with all genes as reference setLoop till reference set size 1%
Determine which codons are preferredDetermine average usage for all genesSort by adherenceTake genes in top half to be the new reference setRepeat
Start with all genes as reference setLoop till reference set size 1%
Determine which codons are preferredDetermine average usage for all genesSort by adherenceTake genes in top half to be the new reference setRepeat
AlgorithmAlgorithm
Do you think all organisms have translational efficiency bias?
How would you expect metabolic efficiency trends to look in organisms that do not have?
04/21/23 14Expression Prediction with CUB
Protein production rate (expressivity)
Bio
synt
hetic
Cos
t
?
Some actually exhibited significant and positive
trends
Some actually exhibited significant and positive
trends
What could cause a positive trend?
Organisms preferentially utilize the most expensive aa’s in the most highly expressed genes?
We decided the problem must be in our prediction of expressivity
Somehow we got it wrong—in fact, it seems we got it exactly opposite
04/21/23 15Expression Prediction with CUB
Protein production rate (expressivity)
Bio
synt
hetic
Cos
t
?
Misbehavers were all high and low GC-content organisms
But how would this cause a positive trend
Breakthrough came with Nostoc
Greedy algorithm was finding high AT-content that were on opposite side of PCA 2D codon usage space
04/21/23 16Expression Prediction with CUB
-8 -6 -4 -2 0 2 4 6-4
-2
0
2
4
6
8
First Principal Component
Se
con
d P
rinci
pa
l Co
mp
on
en
t
PCA on Nostocalgorithm identified reference set
vs. highly expressed
Algorithm is a search for self-consistent genes
What does search space look like—why did the algorithm get fooled
Our lab was heavy into GA’s Think of all
optimization problems in terms of being a search
Fitness landscape04/21/23 17Expression Prediction with CUB
Carbone’s algorithm found the reference set associated with the dominant bias—what about the next most dominant
Carbone’s algorithm found the reference set associated with the dominant bias—what about the next most dominant
How arrange solutions along two axes (with fitness in a third)
How reduce the number of solutions
04/21/23 Expression Prediction with CUB 18
Number of possible solutions
Number of possible solutions
Reference sets tend to be proximal
If choose nearest neighbors will only have to calculate fitness for each gene
We already have a method for viewing gene placement in a 2D space: PCA
Elevated regions: highly self-consistent04/21/23 Expression Prediction with CUB 19
-8 -6 -4 -2 0 2 4 6-4
-2
0
2
4
6
8
First Principal Component
Se
con
d P
rinci
pa
l Co
mp
on
en
t
PCA on Nostocalgorithm identified reference set
vs. highly expressed
AT-content ridge
dominates search space
AT-content ridge
dominates search space
How fix algorithm? I modified the SCCI algorithm to avoid
unbalanced GC-content regions Push down
04/21/23 20Expression Prediction with CUB
Greedy algorithm gets perfect self-consistency scores
Modified algorithm does not
Decided to try using a GA to improve
04/21/23 21Expression Prediction with CUB
We can rebuild him. We have the technology. We have the capability to build the world's first bionic man. Steve Austin will be that man. Better than he was before. Better, stronger, faster.
We can rebuild him. We have the technology. We have the capability to build the world's first bionic man. Steve Austin will be that man. Better than he was before. Better, stronger, faster.
Parent One
Parent Two
g1 g2 g3 g4 g5 … gN
g1 g2 g3 g4 g5 … gN
Child g1 g2 g3 g4 g5 … gN
Mutate
Searched for a set of genes that were both Self-consistent And that identified a bias to which known
highly expressed genes strongly adhered
04/21/23 Expression Prediction with CUB 22
Self-consistent
Ranki
ng o
f H
EG
s
Two ObjectivesTwo Objectives
Count the number of solutions that dominate (better in both dimensions)
Solutions on the Pareto front: no other solution is better in both dimensions
The fewer there are the higher the fitness
Genes on front given highest fitness
04/21/23 Expression Prediction with CUB 23
Self-consistent
Ran
kin
g o
f H
EG
s
Those that identified a bias to which known highly expressed genes strongly adhered was by far the best
But the reference set we identified were not among the most highly expressed… yet the bias it discovered (the codon preferences it identified) yielded much better predictions of actual expressivity
04/21/23 Expression Prediction with CUB 24
Self-consistent
Ran
kin
g o
f H
EG
s
Best SolutionsBest Solutions
We just found a better set of codon preferences
Why not directly search for codon preferences?
Reframe the problem Instead of “given a set of
known highly expressed genes, determine which codons they seem to prefer and use these preferences to rank the whole genome”
We asked “given a set of known highly expressed genes, which set of codon preferences (weights associated with each codon) yield a gene ranking with known highly expressed genes at the top”04/21/23 Expression Prediction with CUB 25
Given a set of known highly expressed genes, which set of codon preferences (weights associated with each codon) yield a gene ranking with known highly expressed genes at the top
04/21/23 Expression Prediction with CUB 26
Parent One
Parent Two
w1 w2 w3 w4 w5 …w59
w1 w2 w3 w4 w5 …w59
Child w1 w2 w3 w4 w5 …w59
Mutate
04/21/23 27Expression Prediction with CUB