correlating mrna and protein abundance via genomic and proteomic characteristics dov greenbaum...
TRANSCRIPT
Correlating mRNA and protein abundance via genomic and proteomic characteristics
Dov Greenbaum
Gerstein LabThesis Seminar
April 21, 2004
outline
Why analyze mRNA and protein correlationsBackground
Disparate Data Sources Correlating mRNA and Protein
ResultsOther analysesFormalism – comparing genome, transcriptome and proteome in terms of broad categories
New Data SetsAnalysis via Broad CategoriesAnalysis of factors affecting correlations
Another reason to expect correlations Expression and Protein Interactions
Why Correlate mRNA & Protein?
0500
100015002000250030003500400045005000
mRNA Protein
Experiments
Both mRNA and Protein Levels are necessary for complete analysis
Combinations of RNA and protein detection approaches have recently aided in theidentification of biomarkers in cancer Hegde et al Current Opinion in Biotech 2003
Shown mathematically in Hatzimanikatis et al Biotechnology 1999
Relationship between mRNA and Protein levels
dPi
dt= ks;i * mRNAi - kd;i Pi
where ks,i and kd,i are the protein synthesis and degradationrate constants, respectively,
At steady state: Pi =ks;i * mRNAi
kdi
Methods for determining mRNA expressionEach have Strengths and Weaknesses
Methods for determining protein abundance
2DE Gel Electrophoresis– (Klose, 1975; O’Farrell, 1975)• Multiple staining options• Small dynamic range• limited in what it can detect
Methods for determining protein abundance
ICAT– ICAT reagent-- relative
levels– VB dynamic range– Cannot detect post-
translational modifications– it require proteins to contain
cysteine residues, & these residues must be in the region of a peptide that is produced during proteolytic
cleavage
MudPit
Really only HT that candetect PT modifications
Other Methods for determining protein abundance
DIGE– e.g. Cy3 vs cy5
labeling– Very big dynamic
range
2D-electrophoresis
Tap Tagging Weissman & O’Shea(Oct 2003)
Other Methods for determining protein abundance
020000
4000060000
80000
2DE
DIG
ICA
TM
PT
apA
ffyMax
01000
20003000
4000
2DE
DIG
ICA
TM
PT
AP
Max Prot
Same mRNA levels yet protein data varied > 20X
N ~100, r = 0.9
Protein Quantification via measurement of radioactivity
Gygi et al Molecular and Cellular Biology,1999.
Same mRNA levels yet protein data varied > 20X
Do some ORFs bias the results?
73 proteins (69%) R = 0.356
mRNA vs Proteinr = 0.74
Protein Quantification via image analysis
Futcher et al Molecular and Cellular Biology, 1999
Jury is out…
Gygi et al: “This study revealed that transcript levels provide little predictive value with respect to the extent of protein expression.”
Futcher et al: “there is a good correlation between protein abundance and mRNA
abundance for the proteins that we have studied”.
mRNA vs Protein
r =0.67
Greenbaum et al Bioinformatics 2001
3 Genes in Lung AdenocarcinomasOp18, Annexin IV, and GAPD r = 0.025
Chen et al Molecular & Cellular Proteomics, 2002.
murine hematopoietic precursor MPROchange in expression 0 - 72 hr
murine hematopoietic precursor MPROchange in expression 0 - 72 hr
R = 0.58~ 80% of the genes are located in the first and third quadrants
Ratios of wt+gal to wt gal ICAT vs microarray
N ~ 290, r = 0.6
Ideker et al Science, 2001
Yeast growth under two different mediar = 0.45 but almost 1.0 for same loci in same pathway
Washburn et al PNAS 2003
Integrating multiple sources of Information
The challenge for computational biology is to provide methodologies for transforming high-throughput heterogeneous data sets into biological insights about the underlying mechanisms. Although high-throughput assays provide a global picture, the details are often noisy, hence conclusions should be supported by several types of observations. Integration Integration of data from assays that examine cellular of data from assays that examine cellular systems from different viewpointssystems from different viewpoints (for instance, gene expression and protein-protein interactions) can lead to a more can lead to a more coherent reconstruction and reduce the coherent reconstruction and reduce the effects of noiseeffects of noise. Nir Friedman Science 2004
Sources of DataData set Description Size [ORFs] Reference
mRNA expression
YoungGene chip profiles yeast cells with mutations that affect transcription 5455 Holstege et al. (1998)
Church Gene chip profiles of yeast cells under four different conditions 6263 Roth et al. (1998)
SamsonComparing gene chip profiles for yeast cells subjected to alkylating agent 6090 Jelinsky et al. (1998)
SAGE Yeast cells during vegetative growth 3778 Velculescu et al. (1997)
Reference expressionScaling and integrating the mRNA expression set into one data source 6249 -
Protein abundance
2-DE #1Measurement of yeast protein abundance by two-dimensional (2D) gel electrophoresis and mass spectrometry 156 Gygi et al. (1999)
2-DE #2 Similar to 2-DE set #1 71 Futcher et al. (1999)
TransposonLarge-scale fusions of yeast genes with lacZ by transposon insertion 1410
Ross-Macdonald et al. (1999)
Reference abundanceScaling and integrating the 2-DE data sets into one data source 181 -
Annotation
Annotated Localization
Subcellular localizations of yeast proteins 2133 (6280) Drawid et al. (2000)
Transmem-brane segments
Predicted transmembrane and soluble proteins in yeast 2710 (6280) Gerstein (1998)
MIPS functions Functional categories for yeast ORFs 3519 (6194) Mewes et al. (2000)
GOR secondary structure
Predicted secondary structure for yeast ORFs 6280 Gerstein (1998)
Reference mRNA Sets
Young
ChurchSamson
SAGE
Fitting Protein Data
Original Set
mRNA vs Protein
r =0.67
Greenbaum et al Bioinformatics 2001
mRNA expression Reference Set 3 Affy Chip sets and SAGE6249 ORFs
Outliers (2STDEV from the mean)
ORF FUNCTION MIPSYBR118W translation elongation factor eEF1 alpha-A chain 5,30YER065C Isocitrate Lyase 1,2, 30YMR303C Alcohol dehydrogenase II 1, 2, 30YOL086C Alcohol dehydrogenase I 1, 2, 30YJR009C Glyceraldehyde-3-phosphate dehydrogenase 2 1, 2, 30YGR192C Glyceraldehyde-3-phosphate dehydrogenase 3 1, 2, 30YJR104C Copper-zinc superoxide dismutase 11,30YML054C lactate dehydrogenase cytochrome b2 1,2,30YJL052W glyceraldehyde-3-phosphate dehydrogenase 1 1,2,30YKR059W Translation initiation factor 5,30YML008C S-adenosyl-methionine delta-24-sterol-c-methyltransferase 1,30YFL022C Phenylalanine-- tRNA Ligase beta chain 5,30YJL008C Component of chaperonin-containing T-complex 6,30YPL160W leucine--tRNA ligase 5,30YOR361C translation initiation factor eIF3 subunit 3,5,30YCL030C phosphoribosyl-AMP cyclohydrolase 1YNL209W heat shock protein of HSP70 family 5,30
abo
ve t
ren
dli
ne
bel
ow
tre
nd
lin
e
High ProteinMetabolism (1)
Energy(2)
Low ProteinProt. Syn. (5)Prot. Fate (6)
Later larger datasets concurred with these results in that Generally…
1
10
100
1000
10000
100000
1000000
10000000
0.1 1 10 100 1000
mRNA
pro
tein
Alcohol dehydrogenase is also a stress induced protein in many organisms (Matton et al. 1990; An et al. 1991; Millar et al. 1994), Faster Ramp Up?
AA metabolism & Energy are 2X as likely to have high protein vs mRNA than the general population
Protein synthesis (~35% of all protein synthesis genes) and Protein fate (folding, modification, destination) are more likely to have low protein vs mRNA than the general population
Non-Outliers Generally…Tight Regulation by the cell
Only 3% of transcription associated genes (n = 441) have significantly uncorrelated mRNA and protein levels (2STDEV from trendline)
Transcription Assoc. genes are 25% of the essential genes in yeast.
Essential Genes as a group have higher correlations than the general yeast population
7% of Cell Cycle associated genes (n = 432) have significant non-correlation
Quick Summary
• Why correlate mRNA and protein levels?• Merged Disparate Data Sets
– Distinct but complimentary
• Global Correlations• Outliers are interesting:
– Metabolism & Energy Relatively high protein levels
– Protein Synthesis & Protein Fate low protein levels
Data Set Size
~6,000 ORFs
~6,000 ORFs5 Affymetrix GeneChips+ SAGE data
~170 ORFs2 DE-gel datasets
Enrichments
(F,[v,S]) -(F,[w,G])(F,[w,G])(Feature, [v,S], [w,G]) =
V & W are weights (expression level) of Sets S & G
Visual Formalism
~170 ORFs ~6,000 ORFs
Depletion of Random Coil Secondary Structure STABILITY
Concurrence with data from Perczel et al Chemistry 2003Regarding stability of specific secondary structures
Alanine’s, Glycines, Valines result in more compact structures More compact = more stable (i.e. thermophilic enzymes tend to be very compact)
Enrichment of Amino Acids STABILITY
Enrichment of Amino Acids
Simple story: translatome is enriched in same way as
transcriptome
Enrichment of Molecular Weights/BiomassAbundant proteins are smaller = reduces cost
yeast cell favors the expression of shorter ORFs over longer ones (as opposed to long lightweight ORFs – see MW of aa)
This selection is happening, for the most part at the transcriptome level--------------------------------------------------------------------------------------------------
Neg Correlation between ORF length and mRNA expression Jansen & Gerstein 2000 (And to a lesser degree with Protein Abundance)
Effect of transcription
Enrichment of Molecular Weights/BiomassAbundant proteins are smaller = reduces cost
CONCURS with experimental results from Akashi, Genetics 2003See also: Akashi,Genetics 1996 & Moriyama and Powell, NAR 1998
hypothesize that this trend exists in S. cerevisiae, D. melanogaster and E. coli. (although probably not in C. elegans)
Effect of transcription
Enrichment of Functional Categories
1
10
100
1000
10000
100000
1000000
10000000
0.1 1 10 100 1000
mRNA
pro
tein
Depletion Functional Categories
Transcription & Cell GrowthMolecular switches
Require only minimal expression
Enrichment of localization - BIAS?
(Drawid & Gerstein. 2000),
Review
Formalism
Different gene sets b/c of limited data
Enrichments
concur with experimental results
Fitting Protein Data
Newer SetMudpit fit first into mRNA space
then inverse fit back into protein space then each of the data sets is fit via least squares onto the Aebersold data set
Aebersold Futcher Reference Yates Gygi mRNA
Aebersold 125 29 113 102 116 125
Futcher 73 61 56 64 69
Reference 150 143 128 150
Yates 1436 785 1346
Gygi 1504 1480
mRNA 6250
Fitting Protein Data
Newer SetMudpit fit first into mRNA space
then inverse fit back into protein space then each of the data sets is fit via least squares onto the Aebersold data set
Aebersold Futcher Reference Yates Gygi mRNA
Aebersold 125 29 113 102 116 125
Futcher 73 61 56 64 69
Reference 150 143 128 150
Yates 1436 785 1346
Gygi 1504 1480
mRNA 6250
Fitting Protein Data
Newer SetMudpit fit first into mRNA space
then inverse fit back into protein space then each of the data sets is fit via least squares onto the Aebersold data set
Aebersold Futcher Reference Yates Gygi mRNA
Aebersold 125 29 113 102 116 125
Futcher 73 61 56 64 69
Reference 150 143 128 150
Yates 1436 785 1346
Gygi 1504 1480
mRNA 6250
Fitting Protein Data
Newer SetMudpit fit first into mRNA space
then inverse fit back into protein space then each of the data sets is fit via least squares onto the Aebersold data set
Aebersold Futcher Reference Yates Gygi mRNA
Aebersold 125 29 113 102 116 125
Futcher 73 61 56 64 69
Reference 150 143 128 150
Yates 1436 785 1346
Gygi 1504 1480
mRNA 6250
Fitting Protein Data
Newer SetMudpit fit first into mRNA space
then inverse fit back into protein space then each of the data sets is fit via least squares onto the Aebersold data set
Aebersold Futcher Reference Yates Gygi mRNA
Aebersold 125 29 113 102 116 125
Futcher 73 61 56 64 69
Reference 150 143 128 150
Yates 1436 785 1346
Gygi 1504 1480
mRNA 6250
Global Correlation
0.1
1
10
100
1000
0.1 1 10 100 1000
mRNA Expression
Pro
tein
Ab
un
dan
ce
MudPit (1)MudPit (2)2DE (1)2DE (2)R = 0.66
mRNA Set 6249 ORFs Protein Set # 2 2 2DE sets & 2 Mudpit ~2000 ORFs
Functional Categories
0.1
1
10
100
1000
0.1 1 10 100
mRNA Expression
Pro
tein
Ab
un
dan
ce
Cell Cycle (R=0.71)
Reference Data (R=0.66)
Cell Rescue (R=0.45)
Co-regulated proteins
High: ion transport , INTERACTION WITH THE CELLULAR ENVIRONMENT, CELL FATE LOW: METABOLISM ,FATE. CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION MECHANISM
Subcellular Localization
0.1
1
10
100
0.1 1 10 100mRNA Expression
Pro
tein
Ab
un
da
nc
e
Nucleolus (R=0.8)
Cell Periphery (R=0.74)
Reference Data (R=0.66)
Mitochondria (R=0.42)
Subcellular LocalizationMudpit does not have the 2DE biases
Lack of correlation in mitochondria Concurs
with experimental results from
Ohlmeier S et al.JBC 2004
Budr =0.76
Golgir = 0.28
Extracellularr = 0.33
Nucleusr = 0.49
Cytoplasmr = 0.50
Mitochondriar = 0.50
Cell Wallr =0.52
Endosomer = 0.87
ER r = 0.61
Membraner = 0.73
P M
r global = 0.46
Expression as a function of localization is well correlated with protein levels (latest data)
Why would we not find strong correlations?
Post translational modifications
Protein degradation
Error and Bias
Top
Top
Top
Bottom
Bottom
Bottom
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Co
rrel
atio
n
Occupancy CAI Coefficient of Variation
Ribosomal OccupancyArava et al. (2003) Proc. Natl. Acad. Sci. USA
Ribosomal Occupancy
Top Frac. 0.78Bot. Frac. 0.30
Our results concurred with experimental findings by Brown and Herschlag’s groups:
Moreover:mRNAs not associated with any polysomes have even less of a correlation r = 0.2 v. strong translational control
Variability of mRNA expression
Top
Bottom
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Co
rre
lati
on
Coefficient of Variation
mRNA Expression Variability
Top Frac. 0.89Bot. Frac. 0.20
0
5
1 0
1 5
2 0
2 5
3 0
3 5
4 0
timemR
NA
ex
pres
sion
Variability of mRNA expression
Top
Bottom
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Co
rre
lati
on
Coefficient of Variation
mRNA Expression Variability
Top Frac. 0.89Bot. Frac. 0.20
0
5
1 0
1 5
2 0
2 5
3 0
3 5
4 0
timemR
NA
ex
pres
sion
Codon Adaptation Index
Top
Bottom
0
0.1
0.2
0.3
0.4
0.5
0.6
Co
rrel
atio
n
CAI
Codon Usage
Top Frac. 0.48Bot. Frac. 0.02
Concurs with experimental data: CAI does not Predict mRNA and protein the same way shown to be the result of different levels ofdegredation
Another summary
Newer, larger data setLooking at Broad Catagories
I Post translational modifications?where we expect PT control --> low r. Where we don’t expect --> high r
Occupancy Variability
II Protein Degradation? CAI
III Experimental Error? next section
Expression and interactions
Types of protein-protein interactions– Protein complexes
• For example: proteasome, ribosome
– Aggregated interactions• Yeast two-hybrid (Y2H)• Genetic/physical interactions from MIPS
Relationship of P-P-interactions to abs. expression level
EE
EED
i
ji
ij
similar protein results
Protein-Protein Interactions & Expression
Correlations
between selected expression timecourses
(all pairs, control)
(strong interactions in perm- anent complexes, clearly diff.)
Cell Cycle CDC28 expt. (Davis) Sets of interactions
(from MIPS)
(Uetz et al.)
Pairwise interactions
Protein-Protein Interactions & Expression Correlations
Sets of interactions
between selected expression timecourses
(all pairs, control)
(from MIPS)
(strong interactions in perm- anent complexes, clearly diff.)
(Uetz et al.)
Cell Cycle CDC28 expt. (Davis)
Pairwise interactions
Permanent vs. Transient Complexes
-0.2
0
0.2
0.4
0.6
0.8
1
-0.2 0 0.2 0.4 0.6 0.8 1 1.2CC
Ro
sett
a
transient
Permanent
.
L Ribosome
S Ribosome
SAGA
Representing Expression Correlations within a Large Complex in a Matrix
MCM3MCM6CDC47MCM2CDC46CDC54
DPB3CDC45DPB2CDC2CDC7POL2HYS2POL32DBF4ORC2ORC6ORC5ORC4ORC3ORC1
MC
M3
MC
M6
CD
C4
7M
CM
2C
DC
46
CD
C5
4
DP
B3
CD
C4
5D
PB
2C
DC
2C
DC
7P
OL
2H
YS
2P
OL
32
DB
F4
OR
C2
OR
C6
OR
C5
OR
C4
OR
C3
OR
C1
correlation
Permanent? Transient?
correlation
L7/L12
correlation
Cell degrades all excess riboosmal proteins, except L7 & L12
Expression Correlations Segment Large Replication Complex into Component Parts
MCM3MCM6CDC47MCM2CDC46CDC54
DPB3CDC45DPB2CDC2CDC7POL2HYS2POL32DBF4ORC2ORC6ORC5ORC4ORC3ORC1
MCMsprots.
ORC
Polym.&
Temporally transient
No distinction visible between components
indicative of the possibility that the two components are really one?
Division is an artifact of their discovery—M Hochstrasser
ProteasomeOverall .43 20S .5019S .51
Proteasome
%ORFs in complexes with significant correlation
Complex (> 2 ORFS, P < 0.001) n alpha Cdc15 Cdc28 Rosetta
Alpha, al-treh. anchor (50) 4 75% 75%
Cacinerum B (100) 3 67% 67%
Chaperone containing T-complex TRiC (130) 8 50% 25%
Pho85p (133.20) 6 33%
Glycine decarboxylase (200) 3 67%
ATPase (210) 4 100% 50%
TRAPP (260.60) 10 40%
Vps4p ATPase (260.70) 3 67%
Nucleosome protein (320). 8 100% 87% 37% 75%
Cytochrome bc1 complex (420.30) 9 44% 78% 78%
Cytochrome c oxidase (420.40) 8 50% 38% 88% 50%
F0/F1 ATP synthase (complex V)(420.5) 15 60%
Ribonucleoside reductase (430) 4 50%
Nuclear processing (440.10.10) 5 40%
RNA polymerase I (510.10) 8 38% 38% 50%
RNA polymerase II (510.40.10) 9 44%
Tornow & Mewes NAR 2003
Average Expression of all subnunits in a complex
y = 3028.4x1.0635
R2 = 0.6076
1
10
100
1000
10000
100000
1000000
10000000
0.1 1 10 100
mRNA expression (x103 )
pro
tein
ab
un
dan
ce
PP INT Summary
Complexes broad catagories minimize noise
– Permanent complexes show strong co-expression Posttranscriptional regulation functions at a whole complex
level (Washburn et al PNAS 2003)
– Transient complexes have weaker co-expression
Aggregated BINARY interactions (Y2H, physical, genetic)Weak co-expression similar to transient complexes --noisy data?
ERROR ? minimized in larger groups
Global Summary
mRNA expression is related to protein abundance
Broad categories minimize noise that prevents us from seeing this correlation
Integrating various genomic data is integral to an analysis
Biologically relevant results can be seen when looking at mRNA and protein populations
Future Research
Further indepth analysis into protein degredation
Integrate new Tap Tagging data into protein abundance ref set
More intensive modeling of the relationship between mRNA and protein
Relationship between mRNA and Protein levels
dPi
dt= ks;i * mRNAi - kd;i Pi
where ks,i and kd,i are the protein synthesis and degradationrate constants, respectively, and is the growth rate
At steady state: Pi =ks;i * mRNAi
kdi
N end rule PEST?
N End Rule in Yeast
1 10 100 1000 10000
Arg
Lys
Phe
Leu
Trp
Asn
His
Asp
Gln
Tyr
Ile
Glu
Cys
Ala
Ser
Thr
Gly
Val
Pro
Met
AA
In Vivo Hallf Life (Min)
Fast DecaySlow Decay
Results of protein degredation
Significantly higher correlation for fast decaying proteins
Not for slow decayhigh decay rate is indicative of greater
cellular control over level e.g. proteins with half lives of days – cell can’t tightly control
Results are same for mRNA degredation --half lives have been quantified
Acknowledgments
Gerstein Lab
This workRonald Jansen (MSKCC)Yuval Kluger (NYU)
Other ProjectsHaiyuan YuHedi HegyiJimmy LinRajdeep DasJiang QianNick Luscombe
Entire Gerstein Lab
Weissman LabZheng Lian
Keck (HHMI Biopolymer Laboratory and W. M. Keck Foundation Biotechnology Resource Laboratory)
Christopher ColangeloKen Williams
Thesis Committee
Mark GersteinSherman WeissmanKevin White
Genetics Department
SABRINA
Liana