advanced statistical topics a · 11/4/2019 · advanced statistical topics a day 1 - high...

Advanced Statistical Topics AAdvanced Statistical Topics A

Day 1 - High dimensional dataDay 1 - High dimensional data

Claus Thorn EkstrømClaus Thorn EkstrømUCPH BiostatisticsUCPH [email protected]@sund.ku.dk

November 4th, 2019November 4th, 2019Slides @ Slides @ course websitecourse website

11

mailto:[email protected]

https://biostatistics.dk/teaching/advtopicsA/

Practicalities

Every day from 8.15 to 15 in room 2.1.12Internet access: eduroam or KU-guestRBreaks as we go alongPlease read before classes

2

1. High-dimensional data,penalized regression,bootstrap, cross-validation

2. Multiple imputationtechniques

3. Classification/regression tress,random forests

4. Dimension reduction,randomization tests

web.stanford.edu/~hastie/CASI/

Course overview

3

https://web.stanford.edu/~hastie/CASI/

Quick refresher on statistical modelsQuick refresher on statistical models

44

Multiple regression / linear models refresher

Model as a function of predictors . For individual

or in matrix notation

�e overall trend or mean e�fect is

y x1, … , xP i

yi = β0 + β1x1i + β2x2i + ⋯ + βP xPi + ϵi

Y = Xβ + ϵ

E(Y ) = Xβ

5

Interpretation of parameters

Each coe�ficient, , expresses the expected change in the response when changes one unit and the remaining explanatory variablesare held fixed.Might be that we have a significant e�fect of in the univariatemodel, while its e�fect in the mult. lin. model is non-significant. Itse�fect is explained by the other explanatory variables.

βj y

xj

x1

6

Generalized linear models

Extends the linear more to other outcomes and distributions. Still focuson the mean e�fect but for the transformed data.

or equivalently

where the link function in principle could be any function that mapsthe population mean into the linear predictor.

g(E[Yi]) = β0 + β1X1i + ⋯ + βP XPi

E[Yi] = g−1(β0 + β1X1i + ⋯ + βP XPi)

g

7

Example: FEV in children

E�fect of smoking on the pulmonary function of children. . Infoon age, height, sex, smoking, and fev (forced expiratory volume).

N = 654

8

library("broom")res <- lm(FEV ~ Age + Ht + I(Ht^2) + Gender + Smoke, data=fev)tidy(res)

# A tibble: 6 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl>1 (Intercept) 6.83 1.50 4.54 6.68e- 62 Age 0.0700 0.00913 7.67 6.31e-143 Ht -10.7 1.96 -5.46 6.85e- 84 I(Ht^2) 4.81 0.635 7.58 1.21e-135 GenderBoy 0.0939 0.0330 2.85 4.51e- 36 SmokeYes -0.136 0.0573 -2.38 1.76e- 2

Each unit change in an produces an average e�fect in .x y

9

Example: Who survived the Titanic disaster?

individuals on the Titanic. Who survived? Info on Class (1st,2nd, 3rd, Crew), Sex, Age group, and survival status.

and then

N = 2201

log( ) = Xβ = β0 + β1x1 + ⋯ + β5x5P(Y = 1)

1 − P(Y = 1)

P(Y = 1) =exp(β0 + β1x1 + ⋯ + β5x5)

1 + exp(β0 + β1x1 + ⋯ + β5x5)

10

res <- glm(Survived ~ Class + Sex + Age, data=titanic, family=binomial)tidy(res)

# A tibble: 6 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl>1 (Intercept) 0.685 0.273 2.51 1.21e- 22 Class2nd -1.02 0.196 -5.19 2.05e- 73 Class3rd -1.78 0.172 -10.4 3.69e-254 ClassCrew -0.858 0.157 -5.45 5.00e- 85 SexFemale 2.42 0.140 17.2 1.43e-666 AgeAdult -1.06 0.244 -4.35 1.36e- 5

Estimates are log odds ratios. Use exp() to transform back to oddsratios.

11

The joy of big data

12

Today's topicsToday's topics

BootstrapBootstrapCross validationCross validationPenalized regressionPenalized regression

Handle complicated modeling aspects with many potential predictors.Handle complicated modeling aspects with many potential predictors.

1414

Evaluating complex methods and data

When we have complex data (og perhaps maybe just big data combinedwith simple methods) and non-parametric methods then we still with toevaluate them.

How stable are the results?

15

The bootstrap/jackknife procedures

Whenever we provide an estimate (mean, proportion, ...) we also want toinfer its precision!

We may or may not be able to formulate a full parametric (or semi-parametric model).

�e bootstrap procedure allows us to estimate the standard error even incomplicated situations or for non-standard statistics.

16

Statistics 101: populations and sample

17

Statistics 101: multiple samples

Di�ferent samples will result in di�ferent outcomes.

If we had the means to produce several samples we would know thesampling distribution.

18

Sample variation

19

Sample variation

20

Sample variation

21

Sample variation

22

Resampling

Have observations and an estimate

for some estimation algorithm.

Want the SE of .

xi ∼ F

θ = s(x)

θ

23

The jackknife estimate

Estimate times - once with each observation removed.

�en the jackknife estimator is

Reduces to the standard error if is a sample average.

θ N

θ (−i) = s(x(−i))

SE(θ) =

⎷

N

∑i=1

(θ (−i) − ^θ (−))2N − 1

N

θ

24


Non-parametric, no assumptions on , samples of size In general: the jackknife standard error is upwardly biased.Only to be used with smooth, di�ferentiable statistic

x <-c(8.26, 6.33, 10.4, 5.27, 5.35, 5.61, 6.12, 6.19, 5.2, 7.01, 8.74, 7.78, 7.02, 6, 6.5, 5.8, 5.12, 7.41, 6.52, 6.21, 12.28, 5.6, 5.38, 6.6, 8.74)CV <- function(x) sqrt(var(x))/mean(x)CV(x)

[1] 0.2524712

library("bootstrap") ; res <- jackknife(x, CV)

F N − 1

25


res

$jack.se[1] 0.05389943

$jack.bias[1] -0.009266436

$jack.values [1] 0.2563873 0.2565586 0.2384298 0.2507329 0.2513200 [6] 0.2530603 0.2557374 0.2560293 0.2501992 0.2580969[11] 0.2541045 0.2577524 0.2581067 0.2551946 0.2571038[16] 0.2541711 0.2495662 0.2581975 0.2571609 0.2561093[21] 0.2020978 0.2529980 0.2515338 0.2573745 0.2541045

$calljackknife(x = x, theta = CV)

26

Estimating bias

When is unbiased then

But if the procedure has bias

then we can estimate the size of the bias from jackknife results.

θ

E(^θ) =N

∑i=1

E(θ (−i)) = θ1

N

E(θ) = θ + + + resta

N

b

N 2

27

Estimating bias

�us,

Furthermore, the bias-corrected jackknife estimate,

is an unbiased estimate of up to second order.

E(^θ − θ) = + resta

N(N − 1)

biasjack = (N − 1)(^θ − θ) = .a

N

θ jack = θ − biasjack

θ 28

Nonparametric bootstrap

Use the sample as the population and generate "fake samples"

Population -> Sample -> "Fake sample"

29

Nonparametric bootstrap

Get a random bootstrap sample from the sample with replacement

�en we can get

Do that times and get information about the full distribution.

x∗ = (x∗1, x∗

2, … , x∗N

)

θ∗

= s(x∗)

B

30

Jackknife vs bootstrap

Jackknife provides stable results (will always get the same result)whereas bootstrap varies.Jackknife only estimates the variance of the point estimator whereasthe bootstrap provides information on the distribution.

SE = √∑(θ∗b

− θ∗−

)2

B − 1

31

Nonparametric bootstrap in R

results <- bootstrap(x, 200, CV)

32

Parametric bootstrap

�e nonparametric bootstrap made no assumptions about thedistribution.

If distribution is known then use that information.

Fit model to dataDraw samples of random numbers from the fitted modeUse those for bootstrap

Useful for small sample sizes (assuming the model holds), di�ficultevaluations, ...

B

33

Parametric bootstrap - resample residuals

Fit model to data, keep predictions and compute a vector ofresiduals, .Create new sets of observations using a randomresidual.Refit the model using the new set of response variables, and computethe statisticDo that times

y i

ϵ i = yi − y i

y∗ = y i + ϵ j

B

34

Bootstrap con�dence intervals

Standard 95% confidence intervals

Could get that directly from the bootstrap results.

mean(results$thetastar) + c(-1.96, 1.96)*sd(results$thetastar)

[1] 0.1497948 0.3262128

Only really works if the distribution is symmetric

θ ± 1.96SE

35

Bootstrap percentile con�dence intervals

Generate the "full" distribution. Cut o�f 2.5% at each end. Use animprovement that depends on the precision of the percentiles.

bcanon(x, 2000, CV)

$confpoints alpha bca point[1,] 0.025 0.1833005[2,] 0.050 0.1957387[3,] 0.100 0.2111287[4,] 0.160 0.2241942[5,] 0.840 0.3100553[6,] 0.900 0.3267094[7,] 0.950 0.3398162[8,] 0.975 0.3699479

$z036

Bootstrap percentile con�dence intervals

Can also use the distribution

boott(x, CV)

$confpoints 0.001 0.01 0.025 0.05 0.1[1,] 0.1103237 0.1438075 0.1640236 0.1723388 0.189396 0.5 0.9 0.95 0.975 0.99[1,] 0.2633288 0.4086446 0.4722277 0.54459 0.6026483 0.999[1,] 0.6582787

$thetaNULL

$gNULL

t

37

Cross-validation

Evaluating a model on the same data results in overfitting

�e production of an analysis that corresponds too closely or exactly to aparticular set of data, and may therefore fail to fit additional data orpredict future observations reliably

38

Prediction error

Let be the response and the prediction based on a model andpredictors .

Two common choices for quantifying prediction error (should useproper scoring rule)

yi y i

xi

D(y, y) = (y − y)2

D(y, y) = {1 if y ≠ y

0 if y = y

40

Error rates

True error rate is (for new observation and fixed prediction method)

�e apparent error rate is

y0

E(D(y0, y0))

∑D(yi, y i)1

N

41

Cross validation

�e problem with the apparent error rate, , is that theprediction model is estimated from the same data as the evaluation. �eapparent error becomes too small

We would like to evaluate on an independent validation set to getunbiased estimate.

Cross-valiation tries to obtain validation data from the original data.

∑D(yi, y i)1N

42

Leave-one-out (LOO) Cross validation

1. Drop data , refit the model and get a prediction model2. Compute 3. Do that for each .

Compute the CV error rate

(yi, xi)

D(yi, y i)

i = 1, … , N

ErrCV = ∑D(yi, y i)1

N

43

Leave-one-out (LOO) Cross validation

library(mlbench)data("PimaIndiansDiabetes2")pima <- PimaIndiansDiabetes2[complete.cases(PimaIndiansDiabetes2),]res2 <- glm(diabetes ~ age + mass + insulin + pregnant, family=binomial, data=pima)library(boot)cv <- cv.glm(pima, res2) ## Default LOO CVcv$delta ## 1st is raw CV error est, 2nd is bias-corrected

[1] 0.1797992 0.1797928

44

-fold Cross-validation

cv <- cv.glm(pima, res2, K=10) ## 10-fold CVcv$delta ## 1st is raw CV error est

[1] 0.1796179 0.1793682

Smaller results in fewer builds of prediction modelSmaller results in groups of data that potentially vary more -greater variation between prediction model.

K

K

K

45

Penalized regressionPenalized regression

4646

Penalized regression models

Recall the generalized linear model

or equivalently

where the function is in principle any function that maps thepopulation mean into the linear predictor. is called the link function

g(E[Yi]) = β0 + β1X1i + ⋯ + βP XPi = Xβ

E[Yi] = g−1(β0 + β1X1i + ⋯ + βP XPi) = g−1(Xβ)

g

g

47

Estimator in linear regression

�e ordinary least squares estimator is

Written as least squares, so minimizes

What do we do then is large?

β = (X tX)−1X ty

β

(y − Xβ)t(y − Xβ) = ∥y − Xβ∥22

P

48

Penalized regression - the lasso

Assume linear mean e�fect:

�e Lasso estimates by minimizing the penalized LS function

so

Y = Xβ + ϵ

β

Zn(β) = (Y − Xβ)t(Y − Xβ) + λN ∥β∥

= ∥(Y − Xβ)∥22 + λN ∥β∥1

1

N

β = arg minβ∈RP Zn(β)

49

Properties of the lasso

Even for or regularized regression methods can:

select a sparse modellead to accurate prediction

Limitations of lasso type regularization:

not consistent in variable selectionnon-standard limiting distributionno oracle propertymultiple testing problem

P > N P ≫ N

50

Example: Biopsy Data on Breast Cancer Patients

Biopsies of breast tumours for 699 patients up to 15 July 1992; each ofnine attributes has been scored on a scale of 1 to 10, and the outcome isalso known.

Predictors: clump thickness, uniformity of cell size, uniformity of cellshape, marginal adhesion, single epithelial cell size, bare nuclei (16values are missing), bland chromatin, normal nucleoli, mitoses.

Output is "benign" or "malignant".

51

Example: The biopsy data

library("MASS") # Get datadata(biopsy)# Remove missing variablesccb <- complete.cases(biopsy)predictors <- as.matrix(scale(biopsy[ccb,2:10])) y <- as.numeric(biopsy$class[ccb])-1

library("glmnet")res <- glmnet(predictors, y, family="binomial")

52

plot(res, lwd=plot(res, lwd=33, xvar=, xvar="lambda""lambda"))

5353


coef(res, s=.3)

10 x 1 sparse Matrix of class "dgCMatrix" 1(Intercept) -0.6425224V1 . V2 0.1394930V3 0.1102769V4 . V5 . V6 0.2221975V7 . V8 . V9 .

54

res2 <- cv.glmnet(predictors, y, family=res2 <- cv.glmnet(predictors, y, family="binomial""binomial"))plot(res2, lwd=plot(res2, lwd=33, cex=, cex=1.81.8))

5555


coef(res, s=exp(-3.3))

10 x 1 sparse Matrix of class "dgCMatrix" 1(Intercept) -0.93594535V1 0.61727404V2 0.44331361V3 0.48275683V4 0.14338234V5 0.05694629V6 0.94445772V7 0.36293816V8 0.28012376V9 .

56

Delassoing

We know the lasso estimates are shrunk towards zero.

What if we want optimized prediction?

Ad hoc approach: delassoing

1. Use lasso for predictor selection2. Refit model with only selected predictors to obtain unbiased

estimates

57

predictors <- scale(predictors)res <- glm(y ~ predictors, family="binomial")tidy(res)

# A tibble: 10 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) -1.09 0.323 -3.38 0.000714 2 predictorsV1 1.51 0.401 3.77 0.000165 3 predictorsV2 -0.0192 0.641 -0.0300 0.976 4 predictorsV3 0.964 0.689 1.40 0.162 5 predictorsV4 0.947 0.354 2.68 0.00740 6 predictorsV5 0.215 0.348 0.617 0.537 7 predictorsV6 1.40 0.342 4.08 0.0000447 8 predictorsV7 1.10 0.420 2.61 0.00907 9 predictorsV8 0.650 0.345 1.89 0.0591 10 predictorsV9 0.927 0.570 1.63 0.104

58

Other penalties

�e Ridge regression estimates by minimizing the penalized LSfunction

this corresponds toadding a small value to the diagonal elements of thecorrelation matrix of the parameters. Introduces bias but reducesvariation. No selection but all predictors scaled down.

β

Zn(β) = ∥(Y − Xβ)∥22 + λN ∥β∥2

1

59

res <- glmnet(predictors, y, alpha = res <- glmnet(predictors, y, alpha = 00))plot(res, lwd=plot(res, lwd=33))

6060

Elastic net

Combine the sparsity from lasso and the �lexibility of ridge regression.

Zn(β) = ∥(Y − Xβ)∥22 + αλN ∥β∥1 + (1 − α)λN ∥β∥2

1

61

res <- glmnet(predictors, y, alpha = res <- glmnet(predictors, y, alpha = .3.3))plot(res, lwd=plot(res, lwd=33))

6262

Regularization and statistical inference

library("selectiveInference")res <- glmnet(predictors, y, family="binomial")lambda <- exp(-4.5)beta <- coef(res, x=predictors, y=y, s=lambda/683, exact=T)res <- fixedLassoInf(predictors, y, beta, lambda, family="binomial", alpha=0.05)

Warning in fixedLogitLassoInf(x, y, beta, lambda, alpha= alpha, type = type, : Solution beta does not satisfythe KKT conditions (to within specified tolerances)

63

res

Call:fixedLassoInf(x = predictors, y = y, beta = beta, lambda = lambda, family = "binomial", alpha = 0.05)

Testing results at lambda = 0.011, with alpha = 0.050

Var Coef Z-score P-value LowConfPt UpConfPt 1 1.509 3.770 0.596 -9.749 2.005 2 -0.019 -0.030 0.989 0.857 Inf 3 0.964 1.399 0.962 -Inf 0.570 4 0.947 2.680 0.808 -17.165 1.237 5 0.215 0.617 0.863 -10.079 0.635 6 1.396 4.084 0.000 0.743 3.114 7 1.095 2.611 0.749 -14.329 1.537 8 0.650 1.889 0.805 -12.461 1.008 9 0.927 1.629 0.705 -10.952 1.724 LowTailArea UpTailArea 64

advanced statistical topics a · 11/4/2019 · advanced statistical topics a day 1 - high...

Documents