experimental design experimental design in high ...•friedman (2004) science •kelly and ideker...
Post on 09-Feb-2021
0 Views
Preview:
TRANSCRIPT
-
1
RethinkingExperimental Design in
High-Dimensional Biology
John D. StoreyUniversity of Washington
Experimental Design
Process of detailing the manner in whichdata should be collected to maximize theamount of information and minimize theuncertainty about some phenomenon
Randomization
In a well designed randomized study…
–Independent factors average out overmany studies to have no overall effect
–Sometimes we can even infer causationfrom association
High-Dimensional Biology
• Rather than studying a few biologicalvariables at a time, instead focus onsystems level behavior according to manyvariables
• Example: Measure RNA and proteinexpression levels from many genes to builda predictive regulatory network
Gene Regulatory Networks
• “Wiring Diagram” -- the way in which thegenes regulate one another and interact
• Most existing methods are based oncorrelation:– Clustering with correlation based distance
– Bayes networks from partial-correlation basedmodel selection
Existing Literature
• Weaver (1999) Proc Pac Symp Biocomp
• Dhaeseleer et al. (2000) Bioinformatics• Milo et al. (2002) Science
• Friedman (2004) Science• Kelly and Ideker (2005) Nat Biotech
• Lee et al. (2006) PNAS
• Many more…
-
2
Causal Modeling
• Randomization is the gold standard forinferring causality
• Rich literature on this topic
• Partial correlation approaches toovulnerable to confounding
• Model selection doesn’t directly test eachindividual causal relationship
• Neither offers a natural measure ofsignificance for causality
Statistical Challenges
• High-throughput molecular profiling dataare essentially observational
• Small sample size, large number of genes• Factors subjected to one randomization (if
any) may effect many variables
• Highly complex regulation and interactionof genes
Randomization inSystems Biology
• What do we get out of classical definitionsof a well designed randomized study?
• Is association likely to imply causation inhigh-throughput biology?
• Is there a better way of thinking aboutrandomization and inferring causality?
Expression in Humans
• Age
• Weight• Health state
• Environment• Genetics
• Batch
• Clinic• Many others…
Illustrative Simulation
Arrays
• 1,000 genes• 20 arrays• Gaussian noise
Gen
es
-
3
Illustrative Simulation
Arrays
Gen
es
• 1,000 genes• 20 arrays• Gaussian noise• Genes 1-300
differentiallyexpressed
Illustrative Simulation• 1,000 genes• 20 arrays• Gaussian noise• Genes 1-300
differentiallyexpressed
• Genes 201-500affected by anindependenthidden variable
Arrays
Gen
esAll P-values Null P-values
Expression Heterogeneity Impact of EH on Testing
• Can massively distort signal– Add spurious signal– Remove real signal
• FDR estimates become highly variable
• Significance ranking of tests become highlyvariable
-
4
P-values for a SingleTrue Null
Paradox?
• Randomization works for interrogating asingle variable when considering lots ofrepeated experiments
• Randomization does not always work forinterrogating many variables at once, evenwhen considering lots of repeatedexperiments
Main Point
• In a traditional well designed study asingle randomization is applied tothousands of variables
• Unmodeled or unmeasured factors shouldbe taken into account
• We have a lot of data… so this is feasible
• A mechanism for performing manyindependent randomizations is needed
Overcoming Dependence
Goal: Estimate factors causing EH andincorporate them into analysis
Problem: Many causes of EH are unknownor unmeasured
Solution: Estimate signatures of EHdirectly from expression data… “SurrogateVariable Analysis”
SVA Algorithm
Identify thegenes that aresignificantlyassociated withthe residualeigengene
Observed DataResidual Data
Regress out theprimaryvariable
The primaryvariable
Calculateresidualeigengenes
Calculateeigengenes ofthe subset
Return tothe originaldata for thatsubset
-
5
Unadjusted P-values SV-adjusted P-values
Unadjusted Null P-values SV-adjusted Null P-values
-
6
Experimental Crosses
• Randomized genetic backgrounds -- theyprovide a natural randomization of the“hardwired” component of the cell
• If DNA variation at a locus causestranscriptional variation for some gene,then randomizing the genotype of the locusin turn randomizes the transcriptionallevels
RIL Experiments
Linkage ⇒Randomized Expression
0.2
0-0.2-0.4
-0.6
-0.8-1
-1.2
-1.4
all Seg Seg inheritBY marker
Seg inheritRM marker
foldchange(log2)
Genetic Analysis of Gene Expression
Seg 1 Seg 2 ... Seg 112
mRNAExpression
6,000 phenotypes
OligoGenotyping
3,000 markers
xBrem & Kruglyak(2005)
Trai
t Pos
ition
Marker Position
Trai
t Pos
ition
Marker Position
“trans-linkage”
-
7
Large-Scale Perturbations
Trai
t Pos
ition
Marker Position
“trans-linkage”
Trai
t Pos
ition
Marker Position
After Surrogate Variable AnalysisTr
ait P
ositi
on
Marker Position
~55% show cis-linkage
Local PerturbationsL = a locus, TA = transcript A, TB = transcript B
Causality Equivalence Theorem: L → TA → TB with no hidden causalrelationships if and only if:
1. L → TA2. L → TB3. L ⊥ TB | TA
-
8
TA Expression TB Expression TB |TA ExpressionTA or TB
Expression
All Segregants L=BY L=RM L=BY L=RM L=BY L=RM
Rel
ativ
e Ex
pres
sion
Lev
els
Regulatory Probabilities
!
Pij = Pr Ti " Tj( )# Pr Li " Ti " Tj( )= Pr Li " Ti( ) $Pr Li " Tj Li " Ti( ) $Pr Li%Tj |Ti[ ]Li " Ti,Li " Tj( )
For each pair of genes i and j…. [i, j = 1,2,…, 6216]
• At 90% probability threshold (FDR=6%):– CNS1 significantly regulates 144 genes– ILV6 significantly regulates 51 genes– SAL1 significantly regulates 36 genes
• At 80% probability threshold (FDR=11%):– NAM9 significantly regulates 14 genes
13%15,448404425580%
6.0%4394214512790%
2.7%149910757695%
FDRNo. EdgesTotal No.Genes
No.Regulators
Probability
CNS1 ILV6
SAL1 NAM9
ribosome
NAM9 GO Analysis
translationmitochondrion
mitochondrialribosome
FDR=6%, All Prob > 90%
-
9
Concluding Remarks
• Benefits of classical well-designed randomizedexperiments are limited in the high-throughputsetting
• Surrogate Variable Analysis captures andmodels large-scale signatures of unmodeledfactors
• RIL experiments offer a natural randomization ofthe hardwired components of the cell
• This randomization can be used to infer causalregulatory relationships among genes -- Trigger
Acknowledgements
Surrogate Variables:• Jeffrey Leek
Causal Networks:
• Lin Chen
NIH grantsR01 HG002913U54 GM2119
CollaboratorsJosh AkeyRon DavisLeonid KruglyakEric SchadtRonald TompkinsWenzhong XiaoGlue grant
Manuscripts
• Leek JT and Storey (2007) Capturingheterogeneity in gene expression studies bysurrogate variable analysis. PLoS Genetics.
• Chen L, Emmert-Streib F, and Storey JD (2007)Harnessing naturally randomized transcriptionto infer regulatory relationships among genes.Genome Biology.
top related