experimental design experimental design in high ...•friedman (2004) science •kelly and ideker...

1

RethinkingExperimental Design in

High-Dimensional Biology

John D. StoreyUniversity of Washington

Experimental Design

Process of detailing the manner in whichdata should be collected to maximize theamount of information and minimize theuncertainty about some phenomenon

Randomization

In a well designed randomized study…

–Independent factors average out overmany studies to have no overall effect

–Sometimes we can even infer causationfrom association

High-Dimensional Biology

• Rather than studying a few biologicalvariables at a time, instead focus onsystems level behavior according to manyvariables

• Example: Measure RNA and proteinexpression levels from many genes to builda predictive regulatory network

Gene Regulatory Networks

• “Wiring Diagram” -- the way in which thegenes regulate one another and interact

• Most existing methods are based oncorrelation:– Clustering with correlation based distance

– Bayes networks from partial-correlation basedmodel selection

Existing Literature

• Weaver (1999) Proc Pac Symp Biocomp

• Dhaeseleer et al. (2000) Bioinformatics• Milo et al. (2002) Science

• Friedman (2004) Science• Kelly and Ideker (2005) Nat Biotech

• Lee et al. (2006) PNAS

• Many more…

2

Causal Modeling

• Randomization is the gold standard forinferring causality

• Rich literature on this topic

• Partial correlation approaches toovulnerable to confounding

• Model selection doesn’t directly test eachindividual causal relationship

• Neither offers a natural measure ofsignificance for causality

Statistical Challenges

• High-throughput molecular profiling dataare essentially observational

• Small sample size, large number of genes• Factors subjected to one randomization (if

any) may effect many variables

• Highly complex regulation and interactionof genes

Randomization inSystems Biology

• What do we get out of classical definitionsof a well designed randomized study?

• Is association likely to imply causation inhigh-throughput biology?

• Is there a better way of thinking aboutrandomization and inferring causality?

Expression in Humans

• Age

• Weight• Health state

• Environment• Genetics

• Batch

• Clinic• Many others…

Illustrative Simulation

Arrays

• 1,000 genes• 20 arrays• Gaussian noise

Gen

es

3

Illustrative Simulation

Arrays

Gen

es

• 1,000 genes• 20 arrays• Gaussian noise• Genes 1-300

differentiallyexpressed

Illustrative Simulation• 1,000 genes• 20 arrays• Gaussian noise• Genes 1-300

differentiallyexpressed

• Genes 201-500affected by anindependenthidden variable

Arrays

Gen

esAll P-values Null P-values

Expression Heterogeneity Impact of EH on Testing

• Can massively distort signal– Add spurious signal– Remove real signal

• FDR estimates become highly variable

• Significance ranking of tests become highlyvariable

4

P-values for a SingleTrue Null

Paradox?

• Randomization works for interrogating asingle variable when considering lots ofrepeated experiments

• Randomization does not always work forinterrogating many variables at once, evenwhen considering lots of repeatedexperiments

Main Point

• In a traditional well designed study asingle randomization is applied tothousands of variables

• Unmodeled or unmeasured factors shouldbe taken into account

• We have a lot of data… so this is feasible

• A mechanism for performing manyindependent randomizations is needed

Overcoming Dependence

Goal: Estimate factors causing EH andincorporate them into analysis

Problem: Many causes of EH are unknownor unmeasured

Solution: Estimate signatures of EHdirectly from expression data… “SurrogateVariable Analysis”

SVA Algorithm

Identify thegenes that aresignificantlyassociated withthe residualeigengene

Observed DataResidual Data

Regress out theprimaryvariable

The primaryvariable

Calculateresidualeigengenes

Calculateeigengenes ofthe subset

Return tothe originaldata for thatsubset

5

Unadjusted P-values SV-adjusted P-values

Unadjusted Null P-values SV-adjusted Null P-values

6

Experimental Crosses

• Randomized genetic backgrounds -- theyprovide a natural randomization of the“hardwired” component of the cell

• If DNA variation at a locus causestranscriptional variation for some gene,then randomizing the genotype of the locusin turn randomizes the transcriptionallevels

RIL Experiments

Linkage ⇒Randomized Expression

0.2

0-0.2-0.4

-0.6

-0.8-1

-1.2

-1.4

all Seg Seg inheritBY marker

Seg inheritRM marker

foldchange(log2)

Genetic Analysis of Gene Expression

Seg 1 Seg 2 ... Seg 112

mRNAExpression

6,000 phenotypes

OligoGenotyping

3,000 markers

xBrem & Kruglyak(2005)

Trai

t Pos

ition

Marker Position

Trai

t Pos

ition

Marker Position

“trans-linkage”

7

Large-Scale Perturbations

Trai

t Pos

ition

Marker Position

“trans-linkage”

Trai

t Pos

ition

Marker Position

After Surrogate Variable AnalysisTr

ait P

ositi

on

Marker Position

~55% show cis-linkage

Local PerturbationsL = a locus, TA = transcript A, TB = transcript B

Causality Equivalence Theorem: L → TA → TB with no hidden causalrelationships if and only if:

1. L → TA2. L → TB3. L ⊥ TB | TA

8

TA Expression TB Expression TB |TA ExpressionTA or TB

Expression

All Segregants L=BY L=RM L=BY L=RM L=BY L=RM

Rel

ativ

e Ex

pres

sion

Lev

els

Regulatory Probabilities

!

Pij = Pr Ti " Tj( )# Pr Li " Ti " Tj( )= Pr Li " Ti( ) $Pr Li " Tj Li " Ti( ) $Pr Li%Tj |Ti[ ]Li " Ti,Li " Tj( )

For each pair of genes i and j…. [i, j = 1,2,…, 6216]

• At 90% probability threshold (FDR=6%):– CNS1 significantly regulates 144 genes– ILV6 significantly regulates 51 genes– SAL1 significantly regulates 36 genes

• At 80% probability threshold (FDR=11%):– NAM9 significantly regulates 14 genes

13%15,448404425580%

6.0%4394214512790%

2.7%149910757695%

FDRNo. EdgesTotal No.Genes

No.Regulators

Probability

CNS1 ILV6

SAL1 NAM9

ribosome

NAM9 GO Analysis

translationmitochondrion

mitochondrialribosome

FDR=6%, All Prob > 90%

9

Concluding Remarks

• Benefits of classical well-designed randomizedexperiments are limited in the high-throughputsetting

• Surrogate Variable Analysis captures andmodels large-scale signatures of unmodeledfactors

• RIL experiments offer a natural randomization ofthe hardwired components of the cell

• This randomization can be used to infer causalregulatory relationships among genes -- Trigger

Acknowledgements

Surrogate Variables:• Jeffrey Leek

Causal Networks:

• Lin Chen

NIH grantsR01 HG002913U54 GM2119

CollaboratorsJosh AkeyRon DavisLeonid KruglyakEric SchadtRonald TompkinsWenzhong XiaoGlue grant

Manuscripts

• Leek JT and Storey (2007) Capturingheterogeneity in gene expression studies bysurrogate variable analysis. PLoS Genetics.

• Chen L, Emmert-Streib F, and Storey JD (2007)Harnessing naturally randomized transcriptionto infer regulatory relationships among genes.Genome Biology.

experimental design experimental design in high ...•friedman (2004) science •kelly and ideker...

Documents

bias & confounding

confounding and misclassification

confounding and interaction

confounding lecture

connecting the dots… our 1 st exposure to research studies...

beng 183 trey ideker protein sequencing

beng 183 trey ideker genotyping - uc san...

experimental designs criteria for experiments independent,...

chance, bias, confounding,

confounding effect

nonlinear regression experimental design mediation vs...

experimental design 1. describe the relationship between a...

confounding expectations

tele-health monitoring by maureen ideker

experimental design (7) - university of...

definition of confounding

confounding in epidemiology

s current projects ideker...

confounding, effect modification and bias - ieh consulting...

jason h. ideker, ph.d. - cce.oregonstate.edujason h. ideker...