lecture 10: experimental designs, resamlpling …...2=n i want width to be )implying 2 1:96˙ p 2=n...

Lecture 10: Experimental designs,resamlpling techniques

Chapters 9 & 10

Geir StorvikMatematisk institutt, Universitetet i Oslo

19. Mars 2014

Content

I Last time:I Statistical models and inference (chap 8)

I Some examples, but very general discussionI Many models and methods, many important details

I Experimental design (chap 9)I Randomization, replication, sample size

I Today:I Continue on experimental design (chap 9)

I Sample size, pooling, blockingI Evaluation of performance (chap 10)

I Training, validation, testI Bootstrapping

I Main aim: Understand the general ideas!

Sample sizeLast time:

I General rule: Need about 5-10 observations for eachparameter

I More specific: Power/performance calculation.

I Example: X1, ...,Xnind∼ N(µC , σ), Y1, ...,Yn

ind∼ N(µT , σ)I 95% confidence interval for ∆ = µC − µT :

x − y ± 1.96σ ·√

2/nI Want width to be δ ⇒ implying

2 · 1.96σ ·√

2/n = δ ⇒ n = 8 · 1.962σ2/δ2

I Testing H0 : ∆ = 0: Reject H0 if |x−y|σ√

2/n> 1.96

I Want Pr(Reject H0|∆ = 1) = 0.8, implying

Pr( |x−y|σ√

2/n> 1.96|∆ = 1) = 0.8⇒ n = 25.98σ2

I Need preliminary experiments to get number for σ.

Testing multiple genes

I Assume now Xi,j ∼ N(µC,i , σi),Yi,j ∼ N(µT ,i , σi)

I Want to test Hi : ∆i = µC,i − µT ,i = 0, i = 1, ...,GReject Hi if |xi−yi |

σi√

2/n> zαG

I How to specify zαG ?I How to specify sample size?

I Lecture 8:I Family-Wise Error rate Pr(FP ≥ 1) < αI False discovery rate

E(FDR) = E(

# false positive predictions#total positive predictions

)I Control E(FDR) at the desired significance level [2]

Family-wise error rate

I Specify/estimate σi for all genesI Bonferroni correction: αG = α/GI Test procedure: Reject those Hi for which|xi−yi |σi√

2/n> zαG

I Example: Assume α = 0.05,G = 1 : zαG = 1.96, G = 12625 : zαG = 4.61.

I Ensure Pr(any type I error) ≤ αI Sample size:

I Compute sample size separately for each gene (as before,but using αG)

I Summarize using appropriate plots

Example: baseline variability for gene expressionexperiment

Vector of 12,625 standard deviations of gene expression datanormalized via the RMA method (ie on log2 scale) with namesfrom Affymetrix probe set IDs.l i b r a r y ( ss ize )data ( exp . sd )h i s t ( exp . sd , n=20)

ssize libraryf o l d . change =2.0 ; s ig . l e v e l =0.05; power =0 .8 ;a l l . s i ze <− ss ize ( sd=exp . sd , de l t a =log2 ( f o l d . change ) ,

s ig . l e v e l =s ig . l eve l , power=power )max( a l l . s i ze )# 559.0834p l o t ( exp . sd , s q r t ( a l l . s i ze ) )ss ize . p l o t ( a l l . s ize , lwd =2 , co l ="magenta " , x l im=c (1 ,20) )

False discovery rate (FDR)Test result

True non-DE DE Totalnon-DE A=9025 B=475 9500DE C=100 D=400 500Total 9125 875 10 000

I False positive rate (type I error): FPR=B/(A+B)=5%I False negative rate (type II error): C/(C+d)=20%I Sensitivity: D/(C+D)=80%I False discovery rate FDR=B/(B+D)=54%I False non-discovery rate FNDR=C/(A+C)=1%I Problem: High proportion of non-DE genesI Can control FDR better by reducing FPR (adjusting α toαG), indirect method

I Better to consider FDR and FNDR directly?

FDR cont

I Consider two populations, σi = σ for all iI General approach for identifying DE genes:

1. Compute test statistic for all genes2. Sort the statistics by order3. Determine a cutoff point beyond which all genes are

assumed to be DEI Bonferroni: Cutoff defined through control of type I error.I FDR: Cutoff defined through control of false discovery rate

OCplus library

I Input:I p0: proportion of non-differentially expressed genesI σ: standard deviation for the log expression valuesI D: assumed average log fold changeI n: sample size

I Output: FDR

OCplus library - samplesize

l i b r a r y ( OCplus )f o l d . change=2res = samplesize ( n=seq (5 ,40 , by=5) ,p0=0.95 ,D=log2 ( f o l d .

change ) , c r i t =0.01 , sigma=1)round ( res , 4 )

FDR_0.01 fdr_0 .015 0.6383 0.699310 0.2525 0.380015 0.0729 0.146720 0.0179 0.043125 0.0043 0.011430 0.0010 0.003035 0.0003 0.000840 0.0001 0.0002

OCplus library - contl i b r a r y ( OCplus )par ( mfrow=c (1 ,2 ) )TOC( p0=c (0 .9 ,0 .95 ,0 .99 ) ,D=1 ,n=5)legend (4 ,0 .3 , c ( "FDR" , " alpha " , " s e n s i t i v i t y " ) , l t y =1:3)TOC( p0=c (0 .9 ,0 .95 ,0 .99 ) ,D=1 ,n=30)legend (3 .5 ,1 , c ( "FDR" , " alpha " , " s e n s i t i v i t y " ) , l t y =1:3)

Sample size and classification

I More complicated, depend on number of genes anddifferences between classes

I General rule: Need 20-30 samples per classI http://linus.nci.nih.gov/brb/samplesize/samplesize4GE.html

Pooling

I Comparing two groups: T = |x−y |σ√

2/n

I Sufficient statistics x and yI Possible to observe x and y directly?I Genomics data: Yes, by pooling individuals!I In addition to reduce number of samples, can reduce

variability!

Pooling

Pooling - possible problems

I Uncertainty due to different amounts from each individualI Pooling at original scale, average calculated on log2 scale

When beneficial?

I When biological variability is larger than “technical”variability

I When only a small number of samples possibleI When individuals are difficult to obtain (small organisms)

Blocking

I Assume of interest to compare two treatments of diseaseI Assume all individuals given treatment 1 are older than 50I Assume all individuals given treatment 2 are younger than

50I Significant difference between groups are foundI Question: Is difference due to treatment or to age?I Problem: Bad design!!!

I Note: Could happen using randomizatonI Better: Control that both age-groups are similar in both

treatment groups

Blocking

I Aim: Reduce variabilityI Approach: Group units into homogeneous groups

eg by sex, size, age etcI Comparison between populations within groupsI Anova type analysis for simultaneous comparisonI Book: Many special designs for Micro arrays (sec 9.4.2. ,

9.5, 9.6)Not that applicable for new technologies

I General block strategies still useful!!I Possibilities in combining blocking and pooling [1]

Complete Block design

I Assume B blocks, P populationsI Assume B individuals from each population availableI Allocate 1 individual from each population within each

blockI Allocate by randomization!!!

Incomplete Block design

I Assume B blocks, P populationsI Assume n < B individuals from each population availableI Balance:

I Allocate equal amount of individuals to each blockI Allocate such that pairwise comparisons of populations are

possible in equal number of blocksI Allocate by randomization!!!I Example: B = 10,P = 6,n = 5

B1(1,2,5),B2(1,2,6),B3(1,3,4),B4(1,3,6),B5(1,4,5)B6(2,3,4),B7(2,3,5),B8(2,4,6),B9(3,5,6),B10(4,5,6)

Summary

I Three main design issuesI ReplicationI RandomizationI Blocking

I Genomics: Pooling possible in additionI Sample size: Also possible with blocking, more

complicated

Chapter 10

I Main aimI Evaluating performance of statistical method

I Selection of method/modelI Performance measure on final method/model

I Two main approachesI Evaluation on test/validation sets

Most useful for predictionI Bootstrapping

Most useful for uncertainty measures on parameterestimates

I No clear difference

Reproducibility and resampling methods

I Discovery-based research can be severely limited by theanalysis and study design or lack thereof, resulting in nonreproducible findings

I Most significant reason for non reproducibility is due tooverfitting

I Using “training data” for prediction severely underestimatespredictive power

I Best: Demonstrate reproducible results in a completelyindependent test set

I Often not feasible.I “Resampling” methods

I Tools for avoiding overfittingI Estimation of prediction error/performance assessmentI Model selection

Prediction error

I Assume collected D = {(yi , zi), i = 1, ...,n}, zi = (zi1, .., zip)

I Prediction: y(z∗) = r(z∗; D)I Linear Regression: y = βT z∗, β = (ZT Z)−1ZT YI Classification: y = arg maxk fk (z∗|θk ), θk = θk (D).

I Performance measuresI Regression: L(y∗, y(z∗)) = (y∗ − y(z∗))2

I Classification: L(y∗, y(z∗)) = I(y∗ = y(z∗))

I Problem: y∗ is unknownI Look instead on E [L(y∗, y(z∗))]

I Random: y∗ and D (possible also z∗)

Within training error rate

I E [L(y∗, y(z∗))] = 1n∑n

i=1 L(yi , y(zi))

I Underestimates E [L(y∗, y(z∗))]

I Linear regression: Worse when p increases!

Frequently, however, we are faced with inadequate samples sizes to split the data

into three sets and one must use the observed sample for model building, selection, and

performance assessment. Again, the determination of how to allocate the data for each

step is based on the aforementioned factors. The simplest method for estimating the

prediction error rate is with the training set, known as the resubstitution or apparent

error. Here all n observations are used for constructing, selecting, and, subsequently,

evaluating the prediction error of the rule rx. For example, with a continuous outcome y

and squared-error loss function, the error rate of rx is given as n�1Pn

i¼1 yi � rx(zi)½ �2.

From the notation, we can observe that the prediction rule is built on the n observations

Figure 10.2 Effect of model complexity (as measured by number of variables in the

model) on two different estimates of prediction error.

Figure 10.3 Large observed sample split into three sets: training for model building,

validation for model selection, and test for assessment of prediction accuracy. (See color insert.)

10.2 METHODS FOR PREDICTION ERROR ASSESSMENT AND MODEL SELECTION 223

I Goal: Find model performing best wrt E [L(y∗, y(z∗))]

I Unbiased estimate: Independent test data, typically notavailable

Dividing data set

Frequently, however, we are faced with inadequate samples sizes to split the data

into three sets and one must use the observed sample for model building, selection, and

performance assessment. Again, the determination of how to allocate the data for each

step is based on the aforementioned factors. The simplest method for estimating the

prediction error rate is with the training set, known as the resubstitution or apparent

error. Here all n observations are used for constructing, selecting, and, subsequently,

evaluating the prediction error of the rule rx. For example, with a continuous outcome y

and squared-error loss function, the error rate of rx is given as n�1Pn

i¼1 yi � rx(zi)½ �2.

From the notation, we can observe that the prediction rule is built on the n observations

Figure 10.2 Effect of model complexity (as measured by number of variables in the

model) on two different estimates of prediction error.

Figure 10.3 Large observed sample split into three sets: training for model building,

validation for model selection, and test for assessment of prediction accuracy. (See color insert.)

10.2 METHODS FOR PREDICTION ERROR ASSESSMENT AND MODEL SELECTION 223

Require large datasets, typically not feasible

Cross-validation

I Want training set as large as possibleI Want many observations to test onI Cross-validation:

I Divide dataset into v partitions P1, ...,PvI Select one partition for test, the v − 1 remaining for training,

calculate performance measure Lv =∑

i∈PvL(yi , y−Pv (zi ))

I Repeat for all v , put L =∑

v Lv .I Advantage: Utilize data in an efficient wayI Disadvantage: Computationally costly (if v large)

Example: Diffuse large-B-cell lymphomaI 240 patients, 7399 genes

y =

{0 activated B-cell1 germinal-center B-cell

I Aim: Classify individuals based on genesI Construction of classification rule:

I Feature selection: Identify 10 most differentially expressedgenes

I k nearest neighbors (k-NN) to build the rule based on these10 genes

I k specified on validation set within training setI Performance evaluation: Whole construction of

classification rule needs to be replicated for each newtraining set

I Using same 10 genes on all training sets do not take intoaccount uncertainty in feature selection.

Example: Diffuse large-B-cell lymphoma

The results, displayed in Table 10.2, indicate that with a sample size of 240 the

prediction error rates are very similar between the variations of v-fold CV, although a

reduction in variability is achieved by using more folds. The number of neighbors k

selected in two-fold CV is similar to that in a split sample with p ¼ 12; additionally,

it is smaller than for the other folds of CV as well as repeated CV for which k

ranges from 6 to 7. The estimates for sensitivity, specificity, PPV, and NPV are similar

across the different folds of CV, although a reduction in variability is achieved by

using repeated CV.

10.6.2.2 Leave-One-Out Cross-Validation Leave-one-out cross-validation

(LOOCV) is the most extreme case of v-fold CV. In this method each observation

is individually assigned to the test set, that is, v ¼ n and k ¼1/n (Lachenbruch and

Mickey, 1968; Geisser, 1975; Stone, 1974, 1977). This method is depicted in

Figure 10.7. LOOCV and the corresponding k ¼ 1/n represent the best example of

a bias-variance trade-off in that it tends toward a small bias with an elevated variance.

The reason for the high variance is due to the fact that the n test sets are so similar

to one another. In model selection, LOOCV has performed poorly compared to

TABLE 10.2 Lymphoma Data Results Using v-Fold CV Approach for Model Selection and

Performance Assessment

Folds

Prediction

error Sensitivity Specificity PPV NPV k-Value

v ¼ 2 0.157 (0.02) 0.900 (0.02) 0.900 (0.02) 0.893 (0.02) 0.908 (0.02) 5.6 (2.0)

v ¼ 5 0.147 (0.02) 0.884 (0.01) 0.894 (0.01) 0.885 (0.01) 0.894 (0.01) 7.2 (0.9)

v ¼ 10 0.149 (0.01) 0.885 (0.01) 0.898 (0.01) 0.888 (0.01) 0.894 (0.01) 6.8 (0.8)

v ¼ 10, V ¼ 5 0.152 (0.01) 0.884 (0.01) 0.900 (0.01) 0.890 (0.01) 0.894 (0.01) 7.1 (0.5)

v ¼ 10, V ¼ 10 0.153 (0.01) 0.888 (0.00) 0.902 (0.01) 0.893 (0.01) 0.897 (0.00) 7.1 (0.2)

Note: Numbers in parentheses are standard deviations.

Figure 10.7 Depiction of training and test set splits for LOOCV. (See color insert.)

230 CHAPTER 10 TECHNIQUES FOR LARGE BIOLOGICAL DATA ANALYSIS

Versions of CV

I v = n: Leave-one-out cross-validation: Low bias, largevariance

I Generalized cross-validation (p is number of parameters):

GCV =1n

n∑i=1

(yi − y(zi)

1− p/n

)2

I Monte Carlo cross-validation: Replicate many timesdivision into v partitions

Variability

I Var(θ): Measure on variability in repeated experimentsI Collect data Db = {(yb

i , zbi ), i = 1, ...,n}, b = 1...,B

I Calculate θb based on DbI Look at how θb vary over b.

I Problem: Not possible to repeat experiment.

Bootstrapping for quantifying uncertainty

I Data D = {(yi , zi), i = 1, ...,n} generated from distr. GI Simulation experiments:

I Generate Db = {(ybi , z

bi ), i = 1, ...,n} on the computer

I Calculate θb based on DbI Look at how θb vary over b.

I Problem: Need to know G, which is unknownI Bootstrap idea: Replace G by G.

I Non-parametric bootstrapping PG((Y ,Z) = (yi , zi)) = 1n

I Equal to resampling with replacement

Example

I Data {(yi , zi), i = 1, ...,n}

observations, corresponding to the n independent subjects in a study. Accordingly, we

observe a random sample X ¼ fx1, . . . , xng, where xi ¼ ( yi, zi) contains an outcome yiand a collection of p measured explanatory variables, or features, zi ¼ (zi1, . . . , zip)

0,

for the ith subject. For the sake of explanation, we will focus on the observed data

shown in the left-most plot of Figure 10.8. As can be seen in the plot, the data contains

a single, continuous feature, that is zi ¼ (zi1), and a continuous outcome, yi.

One approach to exploring the relationship between the explanatory variable, z,

and outcome, y, is to fit a simple linear regression model using an indicator function.

This model can be written as E(Y j Z ¼ z) ¼ a þ bI(z . 0.5), where E(Y j z) is

the expected value (i.e., mean) of the outcome y conditional on the feature value of z,

I(z . 0.5) is the indicator function which equals 1 when z . 0.5 and 0 otherwise, a

is the intercept, and b is the slope. The fit of this model to the data is shown in solid

lines in the second plot in Figure 10.8. In this example, the intercept a is the line on

the left, which is the expected value of y when z , 0.5, and the line on the right, the

expected value of y when z . 0.5, which can be written as a þ b.

10.10.1 Nonparametric Bootstrap

In the first bootstrap method, the actual observed data are sampled as opposed to being

derived from a specified parametric model. Due to the model-free nature of this resam-

pling, it is referred to as the nonparametric bootstrap. As such, B samples of size n are

drawn with replacement from the original data set, also of size n, where the unit drawn

is the observed xi ¼ ( yi, zi). For each of the B bootstrap samples, in our example, we fit

the simple linear regression model using the indicator function, I(z . 0.5), resulting

in a slightly different estimate of the slope and intercept for each sample. Ten such

estimates derived from B ¼ 10 bootstrap samples are shown in the third plot

in Figure 10.8.

As B increases, the bootstrap samples capture the variability of the estimates for

a and a þ b. For example, if we set B ¼ 400, we can extract a 95% confidence inter-

val for the estimates by selecting the 0.025 and 0.975 quantiles of the bootstrap esti-

mates for each z-value. The 95% confidence band estimated from B ¼ 400 bootstrap

samples is illustrated by dashed lines in the fourth plot of Figure 10.8, where the

solid lines represent the simple linear regression fit. The difference in variability

for estimation of a compared to a þ b is clearly visible from the plot. The 95%

Figure 10.8 (a) Plot of observed data. (b) Simple linear regression fit to data. (c) Linear

regression fit to 10 bootstrap samples. (d ) Linear regression fit in solid lines with 95%

confidence band based on 400 bootstrap samples in dashed lines.

10.10 BOOTSTRAP RESAMPLING FOR QUANTIFYING UNCERTAINTY 237

I Model

yi = α + βI(zi > 0.5) + εi , εi ∼ N(0, σ2)

Example - non-parametric bootstrapping

I 10 bootstrap samples and quantiles

observations, corresponding to the n independent subjects in a study. Accordingly, we

observe a random sample X ¼ fx1, . . . , xng, where xi ¼ ( yi, zi) contains an outcome yiand a collection of p measured explanatory variables, or features, zi ¼ (zi1, . . . , zip)

0,

for the ith subject. For the sake of explanation, we will focus on the observed data

shown in the left-most plot of Figure 10.8. As can be seen in the plot, the data contains

a single, continuous feature, that is zi ¼ (zi1), and a continuous outcome, yi.

One approach to exploring the relationship between the explanatory variable, z,

and outcome, y, is to fit a simple linear regression model using an indicator function.

This model can be written as E(Y j Z ¼ z) ¼ a þ bI(z . 0.5), where E(Y j z) is

the expected value (i.e., mean) of the outcome y conditional on the feature value of z,

I(z . 0.5) is the indicator function which equals 1 when z . 0.5 and 0 otherwise, a

is the intercept, and b is the slope. The fit of this model to the data is shown in solid

lines in the second plot in Figure 10.8. In this example, the intercept a is the line on

the left, which is the expected value of y when z , 0.5, and the line on the right, the

expected value of y when z . 0.5, which can be written as a þ b.

10.10.1 Nonparametric Bootstrap

In the first bootstrap method, the actual observed data are sampled as opposed to being

derived from a specified parametric model. Due to the model-free nature of this resam-

pling, it is referred to as the nonparametric bootstrap. As such, B samples of size n are

drawn with replacement from the original data set, also of size n, where the unit drawn

is the observed xi ¼ ( yi, zi). For each of the B bootstrap samples, in our example, we fit

the simple linear regression model using the indicator function, I(z . 0.5), resulting

in a slightly different estimate of the slope and intercept for each sample. Ten such

estimates derived from B ¼ 10 bootstrap samples are shown in the third plot

in Figure 10.8.

As B increases, the bootstrap samples capture the variability of the estimates for

a and a þ b. For example, if we set B ¼ 400, we can extract a 95% confidence inter-

val for the estimates by selecting the 0.025 and 0.975 quantiles of the bootstrap esti-

mates for each z-value. The 95% confidence band estimated from B ¼ 400 bootstrap

samples is illustrated by dashed lines in the fourth plot of Figure 10.8, where the

solid lines represent the simple linear regression fit. The difference in variability

for estimation of a compared to a þ b is clearly visible from the plot. The 95%

Figure 10.8 (a) Plot of observed data. (b) Simple linear regression fit to data. (c) Linear

regression fit to 10 bootstrap samples. (d ) Linear regression fit in solid lines with 95%

confidence band based on 400 bootstrap samples in dashed lines.

10.10 BOOTSTRAP RESAMPLING FOR QUANTIFYING UNCERTAINTY 237

Bootstrapping - alternatives

I Non-parametric bootstrappingI Make few assumptions on how data is generated

(do assume that data are independent!)I Difficult to generalize to more complex situations

I Parametric bootstrappingI Assume G = Gθ

I Linear regression: yi = βT zi + σεi , θ = (β, σ)

I Use data to estimate θ.I When simulating: Simulate from G

θI Depend more on assumptionsI More easy to generalize

Sources - sample size

[1] Paul L Auer and RW Doerge. Statistical design and analysisof RNA sequencing data. Genetics, 185(2):405–416, 2010.

[2] Yudi Pawitan, Stefan Michiels, Serge Koscielny, AriefGusnanto, and Alexander Ploner. False discovery rate,sensitivity and sample size for microarray studies.Bioinformatics, 21(13):3017–3024, 2005.

lecture 10: experimental designs, resamlpling …...2=n i want width to be )implying 2 1:96˙ p 2=n...

Documents