lecture 10: experimental designs, resamlpling …...2=n i want width to be )implying 2 1:96˙ p 2=n...

36
Lecture 10: Experimental designs, resamlpling techniques Chapters 9 & 10 Geir Storvik Matematisk institutt, Universitetet i Oslo 19. Mars 2014

Upload: others

Post on 15-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Lecture 10: Experimental designs,resamlpling techniques

Chapters 9 & 10

Geir StorvikMatematisk institutt, Universitetet i Oslo

19. Mars 2014

Page 2: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Content

I Last time:I Statistical models and inference (chap 8)

I Some examples, but very general discussionI Many models and methods, many important details

I Experimental design (chap 9)I Randomization, replication, sample size

I Today:I Continue on experimental design (chap 9)

I Sample size, pooling, blockingI Evaluation of performance (chap 10)

I Training, validation, testI Bootstrapping

I Main aim: Understand the general ideas!

Page 3: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Sample sizeLast time:

I General rule: Need about 5-10 observations for eachparameter

I More specific: Power/performance calculation.

I Example: X1, ...,Xnind∼ N(µC , σ), Y1, ...,Yn

ind∼ N(µT , σ)I 95% confidence interval for ∆ = µC − µT :

x − y ± 1.96σ ·√

2/nI Want width to be δ ⇒ implying

2 · 1.96σ ·√

2/n = δ ⇒ n = 8 · 1.962σ2/δ2

I Testing H0 : ∆ = 0: Reject H0 if |x−y|σ√

2/n> 1.96

I Want Pr(Reject H0|∆ = 1) = 0.8, implying

Pr( |x−y|σ√

2/n> 1.96|∆ = 1) = 0.8⇒ n = 25.98σ2

I Need preliminary experiments to get number for σ.

Page 4: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Testing multiple genes

I Assume now Xi,j ∼ N(µC,i , σi),Yi,j ∼ N(µT ,i , σi)

I Want to test Hi : ∆i = µC,i − µT ,i = 0, i = 1, ...,GReject Hi if |xi−yi |

σi√

2/n> zαG

I How to specify zαG ?I How to specify sample size?

I Lecture 8:I Family-Wise Error rate Pr(FP ≥ 1) < αI False discovery rate

E(FDR) = E(

# false positive predictions#total positive predictions

)I Control E(FDR) at the desired significance level [2]

Page 5: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Family-wise error rate

I Specify/estimate σi for all genesI Bonferroni correction: αG = α/GI Test procedure: Reject those Hi for which|xi−yi |σi√

2/n> zαG

I Example: Assume α = 0.05,G = 1 : zαG = 1.96, G = 12625 : zαG = 4.61.

I Ensure Pr(any type I error) ≤ αI Sample size:

I Compute sample size separately for each gene (as before,but using αG)

I Summarize using appropriate plots

Page 6: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Example: baseline variability for gene expressionexperiment

Vector of 12,625 standard deviations of gene expression datanormalized via the RMA method (ie on log2 scale) with namesfrom Affymetrix probe set IDs.l i b r a r y ( ss ize )data ( exp . sd )h i s t ( exp . sd , n=20)

Page 7: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

ssize libraryf o l d . change =2.0 ; s ig . l e v e l =0.05; power =0 .8 ;a l l . s i ze <− ss ize ( sd=exp . sd , de l t a =log2 ( f o l d . change ) ,

s ig . l e v e l =s ig . l eve l , power=power )max( a l l . s i ze )# 559.0834p l o t ( exp . sd , s q r t ( a l l . s i ze ) )ss ize . p l o t ( a l l . s ize , lwd =2 , co l ="magenta " , x l im=c (1 ,20) )

Page 8: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

False discovery rate (FDR)Test result

True non-DE DE Totalnon-DE A=9025 B=475 9500DE C=100 D=400 500Total 9125 875 10 000

I False positive rate (type I error): FPR=B/(A+B)=5%I False negative rate (type II error): C/(C+d)=20%I Sensitivity: D/(C+D)=80%I False discovery rate FDR=B/(B+D)=54%I False non-discovery rate FNDR=C/(A+C)=1%I Problem: High proportion of non-DE genesI Can control FDR better by reducing FPR (adjusting α toαG), indirect method

I Better to consider FDR and FNDR directly?

Page 9: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

FDR cont

I Consider two populations, σi = σ for all iI General approach for identifying DE genes:

1. Compute test statistic for all genes2. Sort the statistics by order3. Determine a cutoff point beyond which all genes are

assumed to be DEI Bonferroni: Cutoff defined through control of type I error.I FDR: Cutoff defined through control of false discovery rate

Page 10: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

OCplus library

I Input:I p0: proportion of non-differentially expressed genesI σ: standard deviation for the log expression valuesI D: assumed average log fold changeI n: sample size

I Output: FDR

Page 11: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

OCplus library - samplesize

l i b r a r y ( OCplus )f o l d . change=2res = samplesize ( n=seq (5 ,40 , by=5) ,p0=0.95 ,D=log2 ( f o l d .

change ) , c r i t =0.01 , sigma=1)round ( res , 4 )

FDR_0.01 fdr_0 .015 0.6383 0.699310 0.2525 0.380015 0.0729 0.146720 0.0179 0.043125 0.0043 0.011430 0.0010 0.003035 0.0003 0.000840 0.0001 0.0002

Page 12: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

OCplus library - contl i b r a r y ( OCplus )par ( mfrow=c (1 ,2 ) )TOC( p0=c (0 .9 ,0 .95 ,0 .99 ) ,D=1 ,n=5)legend (4 ,0 .3 , c ( "FDR" , " alpha " , " s e n s i t i v i t y " ) , l t y =1:3)TOC( p0=c (0 .9 ,0 .95 ,0 .99 ) ,D=1 ,n=30)legend (3 .5 ,1 , c ( "FDR" , " alpha " , " s e n s i t i v i t y " ) , l t y =1:3)

Page 13: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Sample size and classification

I More complicated, depend on number of genes anddifferences between classes

I General rule: Need 20-30 samples per classI http://linus.nci.nih.gov/brb/samplesize/samplesize4GE.html

Page 14: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Pooling

I Comparing two groups: T = |x−y |σ√

2/n

I Sufficient statistics x and yI Possible to observe x and y directly?I Genomics data: Yes, by pooling individuals!I In addition to reduce number of samples, can reduce

variability!

Page 15: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Pooling

Page 16: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Pooling - possible problems

I Uncertainty due to different amounts from each individualI Pooling at original scale, average calculated on log2 scale

When beneficial?

I When biological variability is larger than “technical”variability

I When only a small number of samples possibleI When individuals are difficult to obtain (small organisms)

Page 17: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Blocking

I Assume of interest to compare two treatments of diseaseI Assume all individuals given treatment 1 are older than 50I Assume all individuals given treatment 2 are younger than

50I Significant difference between groups are foundI Question: Is difference due to treatment or to age?I Problem: Bad design!!!

I Note: Could happen using randomizatonI Better: Control that both age-groups are similar in both

treatment groups

Page 18: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Blocking

I Aim: Reduce variabilityI Approach: Group units into homogeneous groups

eg by sex, size, age etcI Comparison between populations within groupsI Anova type analysis for simultaneous comparisonI Book: Many special designs for Micro arrays (sec 9.4.2. ,

9.5, 9.6)Not that applicable for new technologies

I General block strategies still useful!!I Possibilities in combining blocking and pooling [1]

Page 19: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Complete Block design

I Assume B blocks, P populationsI Assume B individuals from each population availableI Allocate 1 individual from each population within each

blockI Allocate by randomization!!!

Page 20: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Incomplete Block design

I Assume B blocks, P populationsI Assume n < B individuals from each population availableI Balance:

I Allocate equal amount of individuals to each blockI Allocate such that pairwise comparisons of populations are

possible in equal number of blocksI Allocate by randomization!!!I Example: B = 10,P = 6,n = 5

B1(1,2,5),B2(1,2,6),B3(1,3,4),B4(1,3,6),B5(1,4,5)B6(2,3,4),B7(2,3,5),B8(2,4,6),B9(3,5,6),B10(4,5,6)

Page 21: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Summary

I Three main design issuesI ReplicationI RandomizationI Blocking

I Genomics: Pooling possible in additionI Sample size: Also possible with blocking, more

complicated

Page 22: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Chapter 10

I Main aimI Evaluating performance of statistical method

I Selection of method/modelI Performance measure on final method/model

I Two main approachesI Evaluation on test/validation sets

Most useful for predictionI Bootstrapping

Most useful for uncertainty measures on parameterestimates

I No clear difference

Page 23: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Reproducibility and resampling methods

I Discovery-based research can be severely limited by theanalysis and study design or lack thereof, resulting in nonreproducible findings

I Most significant reason for non reproducibility is due tooverfitting

I Using “training data” for prediction severely underestimatespredictive power

I Best: Demonstrate reproducible results in a completelyindependent test set

I Often not feasible.I “Resampling” methods

I Tools for avoiding overfittingI Estimation of prediction error/performance assessmentI Model selection

Page 24: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Prediction error

I Assume collected D = {(yi , zi), i = 1, ...,n}, zi = (zi1, .., zip)

I Prediction: y(z∗) = r(z∗; D)I Linear Regression: y = βT z∗, β = (ZT Z)−1ZT YI Classification: y = arg maxk fk (z∗|θk ), θk = θk (D).

I Performance measuresI Regression: L(y∗, y(z∗)) = (y∗ − y(z∗))2

I Classification: L(y∗, y(z∗)) = I(y∗ = y(z∗))

I Problem: y∗ is unknownI Look instead on E [L(y∗, y(z∗))]

I Random: y∗ and D (possible also z∗)

Page 25: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Within training error rate

I E [L(y∗, y(z∗))] = 1n∑n

i=1 L(yi , y(zi))

I Underestimates E [L(y∗, y(z∗))]

I Linear regression: Worse when p increases!

Frequently, however, we are faced with inadequate samples sizes to split the data

into three sets and one must use the observed sample for model building, selection, and

performance assessment. Again, the determination of how to allocate the data for each

step is based on the aforementioned factors. The simplest method for estimating the

prediction error rate is with the training set, known as the resubstitution or apparent

error. Here all n observations are used for constructing, selecting, and, subsequently,

evaluating the prediction error of the rule rx. For example, with a continuous outcome y

and squared-error loss function, the error rate of rx is given as n�1Pn

i¼1 yi � rx(zi)½ �2.

From the notation, we can observe that the prediction rule is built on the n observations

Figure 10.2 Effect of model complexity (as measured by number of variables in the

model) on two different estimates of prediction error.

Figure 10.3 Large observed sample split into three sets: training for model building,

validation for model selection, and test for assessment of prediction accuracy. (See color insert.)

10.2 METHODS FOR PREDICTION ERROR ASSESSMENT AND MODEL SELECTION 223

I Goal: Find model performing best wrt E [L(y∗, y(z∗))]

I Unbiased estimate: Independent test data, typically notavailable

Page 26: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Dividing data set

Frequently, however, we are faced with inadequate samples sizes to split the data

into three sets and one must use the observed sample for model building, selection, and

performance assessment. Again, the determination of how to allocate the data for each

step is based on the aforementioned factors. The simplest method for estimating the

prediction error rate is with the training set, known as the resubstitution or apparent

error. Here all n observations are used for constructing, selecting, and, subsequently,

evaluating the prediction error of the rule rx. For example, with a continuous outcome y

and squared-error loss function, the error rate of rx is given as n�1Pn

i¼1 yi � rx(zi)½ �2.

From the notation, we can observe that the prediction rule is built on the n observations

Figure 10.2 Effect of model complexity (as measured by number of variables in the

model) on two different estimates of prediction error.

Figure 10.3 Large observed sample split into three sets: training for model building,

validation for model selection, and test for assessment of prediction accuracy. (See color insert.)

10.2 METHODS FOR PREDICTION ERROR ASSESSMENT AND MODEL SELECTION 223

Require large datasets, typically not feasible

Page 27: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Cross-validation

I Want training set as large as possibleI Want many observations to test onI Cross-validation:

I Divide dataset into v partitions P1, ...,PvI Select one partition for test, the v − 1 remaining for training,

calculate performance measure Lv =∑

i∈PvL(yi , y−Pv (zi ))

I Repeat for all v , put L =∑

v Lv .I Advantage: Utilize data in an efficient wayI Disadvantage: Computationally costly (if v large)

Page 28: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Example: Diffuse large-B-cell lymphomaI 240 patients, 7399 genes

y =

{0 activated B-cell1 germinal-center B-cell

I Aim: Classify individuals based on genesI Construction of classification rule:

I Feature selection: Identify 10 most differentially expressedgenes

I k nearest neighbors (k-NN) to build the rule based on these10 genes

I k specified on validation set within training setI Performance evaluation: Whole construction of

classification rule needs to be replicated for each newtraining set

I Using same 10 genes on all training sets do not take intoaccount uncertainty in feature selection.

Page 29: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Example: Diffuse large-B-cell lymphoma

The results, displayed in Table 10.2, indicate that with a sample size of 240 the

prediction error rates are very similar between the variations of v-fold CV, although a

reduction in variability is achieved by using more folds. The number of neighbors k

selected in two-fold CV is similar to that in a split sample with p ¼ 12; additionally,

it is smaller than for the other folds of CV as well as repeated CV for which k

ranges from 6 to 7. The estimates for sensitivity, specificity, PPV, and NPV are similar

across the different folds of CV, although a reduction in variability is achieved by

using repeated CV.

10.6.2.2 Leave-One-Out Cross-Validation Leave-one-out cross-validation

(LOOCV) is the most extreme case of v-fold CV. In this method each observation

is individually assigned to the test set, that is, v ¼ n and k ¼1/n (Lachenbruch and

Mickey, 1968; Geisser, 1975; Stone, 1974, 1977). This method is depicted in

Figure 10.7. LOOCV and the corresponding k ¼ 1/n represent the best example of

a bias-variance trade-off in that it tends toward a small bias with an elevated variance.

The reason for the high variance is due to the fact that the n test sets are so similar

to one another. In model selection, LOOCV has performed poorly compared to

TABLE 10.2 Lymphoma Data Results Using v-Fold CV Approach for Model Selection and

Performance Assessment

Folds

Prediction

error Sensitivity Specificity PPV NPV k-Value

v ¼ 2 0.157 (0.02) 0.900 (0.02) 0.900 (0.02) 0.893 (0.02) 0.908 (0.02) 5.6 (2.0)

v ¼ 5 0.147 (0.02) 0.884 (0.01) 0.894 (0.01) 0.885 (0.01) 0.894 (0.01) 7.2 (0.9)

v ¼ 10 0.149 (0.01) 0.885 (0.01) 0.898 (0.01) 0.888 (0.01) 0.894 (0.01) 6.8 (0.8)

v ¼ 10, V ¼ 5 0.152 (0.01) 0.884 (0.01) 0.900 (0.01) 0.890 (0.01) 0.894 (0.01) 7.1 (0.5)

v ¼ 10, V ¼ 10 0.153 (0.01) 0.888 (0.00) 0.902 (0.01) 0.893 (0.01) 0.897 (0.00) 7.1 (0.2)

Note: Numbers in parentheses are standard deviations.

Figure 10.7 Depiction of training and test set splits for LOOCV. (See color insert.)

230 CHAPTER 10 TECHNIQUES FOR LARGE BIOLOGICAL DATA ANALYSIS

Page 30: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Versions of CV

I v = n: Leave-one-out cross-validation: Low bias, largevariance

I Generalized cross-validation (p is number of parameters):

GCV =1n

n∑i=1

(yi − y(zi)

1− p/n

)2

I Monte Carlo cross-validation: Replicate many timesdivision into v partitions

Page 31: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Variability

I Var(θ): Measure on variability in repeated experimentsI Collect data Db = {(yb

i , zbi ), i = 1, ...,n}, b = 1...,B

I Calculate θb based on DbI Look at how θb vary over b.

I Problem: Not possible to repeat experiment.

Page 32: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Bootstrapping for quantifying uncertainty

I Data D = {(yi , zi), i = 1, ...,n} generated from distr. GI Simulation experiments:

I Generate Db = {(ybi , z

bi ), i = 1, ...,n} on the computer

I Calculate θb based on DbI Look at how θb vary over b.

I Problem: Need to know G, which is unknownI Bootstrap idea: Replace G by G.

I Non-parametric bootstrapping PG((Y ,Z) = (yi , zi)) = 1n

I Equal to resampling with replacement

Page 33: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Example

I Data {(yi , zi), i = 1, ...,n}

observations, corresponding to the n independent subjects in a study. Accordingly, we

observe a random sample X ¼ fx1, . . . , xng, where xi ¼ ( yi, zi) contains an outcome yiand a collection of p measured explanatory variables, or features, zi ¼ (zi1, . . . , zip)

0,

for the ith subject. For the sake of explanation, we will focus on the observed data

shown in the left-most plot of Figure 10.8. As can be seen in the plot, the data contains

a single, continuous feature, that is zi ¼ (zi1), and a continuous outcome, yi.

One approach to exploring the relationship between the explanatory variable, z,

and outcome, y, is to fit a simple linear regression model using an indicator function.

This model can be written as E(Y j Z ¼ z) ¼ a þ bI(z . 0.5), where E(Y j z) is

the expected value (i.e., mean) of the outcome y conditional on the feature value of z,

I(z . 0.5) is the indicator function which equals 1 when z . 0.5 and 0 otherwise, a

is the intercept, and b is the slope. The fit of this model to the data is shown in solid

lines in the second plot in Figure 10.8. In this example, the intercept a is the line on

the left, which is the expected value of y when z , 0.5, and the line on the right, the

expected value of y when z . 0.5, which can be written as a þ b.

10.10.1 Nonparametric Bootstrap

In the first bootstrap method, the actual observed data are sampled as opposed to being

derived from a specified parametric model. Due to the model-free nature of this resam-

pling, it is referred to as the nonparametric bootstrap. As such, B samples of size n are

drawn with replacement from the original data set, also of size n, where the unit drawn

is the observed xi ¼ ( yi, zi). For each of the B bootstrap samples, in our example, we fit

the simple linear regression model using the indicator function, I(z . 0.5), resulting

in a slightly different estimate of the slope and intercept for each sample. Ten such

estimates derived from B ¼ 10 bootstrap samples are shown in the third plot

in Figure 10.8.

As B increases, the bootstrap samples capture the variability of the estimates for

a and a þ b. For example, if we set B ¼ 400, we can extract a 95% confidence inter-

val for the estimates by selecting the 0.025 and 0.975 quantiles of the bootstrap esti-

mates for each z-value. The 95% confidence band estimated from B ¼ 400 bootstrap

samples is illustrated by dashed lines in the fourth plot of Figure 10.8, where the

solid lines represent the simple linear regression fit. The difference in variability

for estimation of a compared to a þ b is clearly visible from the plot. The 95%

Figure 10.8 (a) Plot of observed data. (b) Simple linear regression fit to data. (c) Linear

regression fit to 10 bootstrap samples. (d ) Linear regression fit in solid lines with 95%

confidence band based on 400 bootstrap samples in dashed lines.

10.10 BOOTSTRAP RESAMPLING FOR QUANTIFYING UNCERTAINTY 237

I Model

yi = α + βI(zi > 0.5) + εi , εi ∼ N(0, σ2)

Page 34: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Example - non-parametric bootstrapping

I 10 bootstrap samples and quantiles

observations, corresponding to the n independent subjects in a study. Accordingly, we

observe a random sample X ¼ fx1, . . . , xng, where xi ¼ ( yi, zi) contains an outcome yiand a collection of p measured explanatory variables, or features, zi ¼ (zi1, . . . , zip)

0,

for the ith subject. For the sake of explanation, we will focus on the observed data

shown in the left-most plot of Figure 10.8. As can be seen in the plot, the data contains

a single, continuous feature, that is zi ¼ (zi1), and a continuous outcome, yi.

One approach to exploring the relationship between the explanatory variable, z,

and outcome, y, is to fit a simple linear regression model using an indicator function.

This model can be written as E(Y j Z ¼ z) ¼ a þ bI(z . 0.5), where E(Y j z) is

the expected value (i.e., mean) of the outcome y conditional on the feature value of z,

I(z . 0.5) is the indicator function which equals 1 when z . 0.5 and 0 otherwise, a

is the intercept, and b is the slope. The fit of this model to the data is shown in solid

lines in the second plot in Figure 10.8. In this example, the intercept a is the line on

the left, which is the expected value of y when z , 0.5, and the line on the right, the

expected value of y when z . 0.5, which can be written as a þ b.

10.10.1 Nonparametric Bootstrap

In the first bootstrap method, the actual observed data are sampled as opposed to being

derived from a specified parametric model. Due to the model-free nature of this resam-

pling, it is referred to as the nonparametric bootstrap. As such, B samples of size n are

drawn with replacement from the original data set, also of size n, where the unit drawn

is the observed xi ¼ ( yi, zi). For each of the B bootstrap samples, in our example, we fit

the simple linear regression model using the indicator function, I(z . 0.5), resulting

in a slightly different estimate of the slope and intercept for each sample. Ten such

estimates derived from B ¼ 10 bootstrap samples are shown in the third plot

in Figure 10.8.

As B increases, the bootstrap samples capture the variability of the estimates for

a and a þ b. For example, if we set B ¼ 400, we can extract a 95% confidence inter-

val for the estimates by selecting the 0.025 and 0.975 quantiles of the bootstrap esti-

mates for each z-value. The 95% confidence band estimated from B ¼ 400 bootstrap

samples is illustrated by dashed lines in the fourth plot of Figure 10.8, where the

solid lines represent the simple linear regression fit. The difference in variability

for estimation of a compared to a þ b is clearly visible from the plot. The 95%

Figure 10.8 (a) Plot of observed data. (b) Simple linear regression fit to data. (c) Linear

regression fit to 10 bootstrap samples. (d ) Linear regression fit in solid lines with 95%

confidence band based on 400 bootstrap samples in dashed lines.

10.10 BOOTSTRAP RESAMPLING FOR QUANTIFYING UNCERTAINTY 237

Page 35: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Bootstrapping - alternatives

I Non-parametric bootstrappingI Make few assumptions on how data is generated

(do assume that data are independent!)I Difficult to generalize to more complex situations

I Parametric bootstrappingI Assume G = Gθ

I Linear regression: yi = βT zi + σεi , θ = (β, σ)

I Use data to estimate θ.I When simulating: Simulate from G

θI Depend more on assumptionsI More easy to generalize

Page 36: Lecture 10: Experimental designs, resamlpling …...2=n I Want width to be )implying 2 1:96˙ p 2=n = )n = 8 1:962˙2= 2 I Testing H 0: = 0: Reject H 0 if jx yj ˙ p 2=n >1:96 I Want

Sources - sample size

[1] Paul L Auer and RW Doerge. Statistical design and analysisof RNA sequencing data. Genetics, 185(2):405–416, 2010.

[2] Yudi Pawitan, Stefan Michiels, Serge Koscielny, AriefGusnanto, and Alexander Ploner. False discovery rate,sensitivity and sample size for microarray studies.Bioinformatics, 21(13):3017–3024, 2005.