advanced statistical topics a · 11/4/2019  · advanced statistical topics a day 1 - high...

64
Advanced Statistical Topics A Advanced Statistical Topics A Day 1 - High dimensional data Day 1 - High dimensional data Claus Thorn Ekstrøm Claus Thorn Ekstrøm UCPH Biostatistics UCPH Biostatistics [email protected] [email protected] November 4th, 2019 November 4th, 2019 Slides @ Slides @ course website course website 1

Upload: others

Post on 15-Feb-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Advanced Statistical Topics AAdvanced Statistical Topics A

Day 1 - High dimensional dataDay 1 - High dimensional data

Claus Thorn EkstrømClaus Thorn EkstrømUCPH BiostatisticsUCPH [email protected]@sund.ku.dk

November 4th, 2019November 4th, 2019Slides @ Slides @ course websitecourse website

11

Page 2: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Practicalities

Every day from 8.15 to 15 in room 2.1.12Internet access: eduroam or KU-guestRBreaks as we go alongPlease read before classes

2

Page 3: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

1. High-dimensional data,penalized regression,bootstrap, cross-validation

2. Multiple imputationtechniques

3. Classification/regression tress,random forests

4. Dimension reduction,randomization tests

web.stanford.edu/~hastie/CASI/

Course overview

3

Page 4: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Quick refresher on statistical modelsQuick refresher on statistical models

44

Page 5: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Multiple regression / linear models refresher

Model as a function of predictors . For individual

or in matrix notation

�e overall trend or mean e�fect is

y x1, … , xP i

yi = β0 + β1x1i + β2x2i + ⋯ + βP xPi + ϵi

Y = Xβ + ϵ

E(Y ) = Xβ

5

Page 6: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Interpretation of parameters

Each coe�ficient, , expresses the expected change in the response when changes one unit and the remaining explanatory variablesare held fixed.Might be that we have a significant e�fect of in the univariatemodel, while its e�fect in the mult. lin. model is non-significant. Itse�fect is explained by the other explanatory variables.

βj y

xj

x1

6

Page 7: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Generalized linear models

Extends the linear more to other outcomes and distributions. Still focuson the mean e�fect but for the transformed data.

or equivalently

where the link function in principle could be any function that mapsthe population mean into the linear predictor.

g(E[Yi]) = β0 + β1X1i + ⋯ + βP XPi

E[Yi] = g−1(β0 + β1X1i + ⋯ + βP XPi)

g

7

Page 8: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Example: FEV in children

E�fect of smoking on the pulmonary function of children. . Infoon age, height, sex, smoking, and fev (forced expiratory volume).

N = 654

8

Page 9: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

library("broom")res <- lm(FEV ~ Age + Ht + I(Ht^2) + Gender + Smoke, data=fev)tidy(res)

# A tibble: 6 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl>1 (Intercept) 6.83 1.50 4.54 6.68e- 62 Age 0.0700 0.00913 7.67 6.31e-143 Ht -10.7 1.96 -5.46 6.85e- 84 I(Ht^2) 4.81 0.635 7.58 1.21e-135 GenderBoy 0.0939 0.0330 2.85 4.51e- 36 SmokeYes -0.136 0.0573 -2.38 1.76e- 2

Each unit change in an produces an average e�fect in .x y

9

Page 10: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Example: Who survived the Titanic disaster?

individuals on the Titanic. Who survived? Info on Class (1st,2nd, 3rd, Crew), Sex, Age group, and survival status.

and then

N = 2201

log( ) = Xβ = β0 + β1x1 + ⋯ + β5x5P(Y = 1)

1 − P(Y = 1)

P(Y = 1) =exp(β0 + β1x1 + ⋯ + β5x5)

1 + exp(β0 + β1x1 + ⋯ + β5x5)

10

Page 11: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

res <- glm(Survived ~ Class + Sex + Age, data=titanic, family=binomial)tidy(res)

# A tibble: 6 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl>1 (Intercept) 0.685 0.273 2.51 1.21e- 22 Class2nd -1.02 0.196 -5.19 2.05e- 73 Class3rd -1.78 0.172 -10.4 3.69e-254 ClassCrew -0.858 0.157 -5.45 5.00e- 85 SexFemale 2.42 0.140 17.2 1.43e-666 AgeAdult -1.06 0.244 -4.35 1.36e- 5

Estimates are log odds ratios. Use exp() to transform back to oddsratios.

11

Page 12: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

The joy of big data

12

Page 13: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

13

Page 14: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Today's topicsToday's topics

BootstrapBootstrapCross validationCross validationPenalized regressionPenalized regression

Handle complicated modeling aspects with many potential predictors.Handle complicated modeling aspects with many potential predictors.

1414

Page 15: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Evaluating complex methods and data

When we have complex data (og perhaps maybe just big data combinedwith simple methods) and non-parametric methods then we still with toevaluate them.

How stable are the results?

15

Page 16: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

The bootstrap/jackknife procedures

Whenever we provide an estimate (mean, proportion, ...) we also want toinfer its precision!

We may or may not be able to formulate a full parametric (or semi-parametric model).

�e bootstrap procedure allows us to estimate the standard error even incomplicated situations or for non-standard statistics.

16

Page 17: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Statistics 101: populations and sample

17

Page 18: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Statistics 101: multiple samples

Di�ferent samples will result in di�ferent outcomes.

If we had the means to produce several samples we would know thesampling distribution.

18

Page 19: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Sample variation

19

Page 20: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Sample variation

20

Page 21: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Sample variation

21

Page 22: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Sample variation

22

Page 23: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Resampling

Have observations and an estimate

for some estimation algorithm.

Want the SE of .

xi ∼ F

θ = s(x)

θ

23

Page 24: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

The jackknife estimate

Estimate times - once with each observation removed.

�en the jackknife estimator is

Reduces to the standard error if is a sample average.

θ N

θ (−i) = s(x(−i))

SE(θ) =

N

∑i=1

(θ (−i) − ^θ (−))2N − 1

N

θ

24

Page 25: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

The jackknife estimate

Non-parametric, no assumptions on , samples of size In general: the jackknife standard error is upwardly biased.Only to be used with smooth, di�ferentiable statistic

x <-c(8.26, 6.33, 10.4, 5.27, 5.35, 5.61, 6.12, 6.19, 5.2, 7.01, 8.74, 7.78, 7.02, 6, 6.5, 5.8, 5.12, 7.41, 6.52, 6.21, 12.28, 5.6, 5.38, 6.6, 8.74)CV <- function(x) sqrt(var(x))/mean(x)CV(x)

[1] 0.2524712

library("bootstrap") ; res <- jackknife(x, CV)

F N − 1

25

Page 26: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

The jackknife estimate

res

$jack.se[1] 0.05389943

$jack.bias[1] -0.009266436

$jack.values [1] 0.2563873 0.2565586 0.2384298 0.2507329 0.2513200 [6] 0.2530603 0.2557374 0.2560293 0.2501992 0.2580969[11] 0.2541045 0.2577524 0.2581067 0.2551946 0.2571038[16] 0.2541711 0.2495662 0.2581975 0.2571609 0.2561093[21] 0.2020978 0.2529980 0.2515338 0.2573745 0.2541045

$calljackknife(x = x, theta = CV)

26

Page 27: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Estimating bias

When is unbiased then

But if the procedure has bias

then we can estimate the size of the bias from jackknife results.

θ

E(^θ) =N

∑i=1

E(θ (−i)) = θ1

N

E(θ) = θ + + + resta

N

b

N 2

27

Page 28: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Estimating bias

�us,

Furthermore, the bias-corrected jackknife estimate,

is an unbiased estimate of up to second order.

E(^θ − θ) = + resta

N(N − 1)

biasjack = (N − 1)(^θ − θ) = .a

N

θ jack = θ − biasjack

θ 28

Page 29: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Nonparametric bootstrap

Use the sample as the population and generate "fake samples"

Population -> Sample -> "Fake sample"

29

Page 30: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Nonparametric bootstrap

Get a random bootstrap sample from the sample with replacement

�en we can get

Do that times and get information about the full distribution.

x∗ = (x∗1, x∗

2, … , x∗N

)

θ∗

= s(x∗)

B

30

Page 31: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Jackknife vs bootstrap

Jackknife provides stable results (will always get the same result)whereas bootstrap varies.Jackknife only estimates the variance of the point estimator whereasthe bootstrap provides information on the distribution.

SE = √∑(θ∗b

− θ∗−

)2

B − 1

31

Page 32: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Nonparametric bootstrap in R

results <- bootstrap(x, 200, CV)

32

Page 33: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Parametric bootstrap

�e nonparametric bootstrap made no assumptions about thedistribution.

If distribution is known then use that information.

Fit model to dataDraw samples of random numbers from the fitted modeUse those for bootstrap

Useful for small sample sizes (assuming the model holds), di�ficultevaluations, ...

B

33

Page 34: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Parametric bootstrap - resample residuals

Fit model to data, keep predictions and compute a vector ofresiduals, .Create new sets of observations using a randomresidual.Refit the model using the new set of response variables, and computethe statisticDo that times

y i

ϵ i = yi − y i

y∗ = y i + ϵ j

B

34

Page 35: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Bootstrap con�dence intervals

Standard 95% confidence intervals

Could get that directly from the bootstrap results.

mean(results$thetastar) + c(-1.96, 1.96)*sd(results$thetastar)

[1] 0.1497948 0.3262128

Only really works if the distribution is symmetric

θ ± 1.96SE

35

Page 36: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Bootstrap percentile con�dence intervals

Generate the "full" distribution. Cut o�f 2.5% at each end. Use animprovement that depends on the precision of the percentiles.

bcanon(x, 2000, CV)

$confpoints alpha bca point[1,] 0.025 0.1833005[2,] 0.050 0.1957387[3,] 0.100 0.2111287[4,] 0.160 0.2241942[5,] 0.840 0.3100553[6,] 0.900 0.3267094[7,] 0.950 0.3398162[8,] 0.975 0.3699479

$z036

Page 37: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Bootstrap percentile con�dence intervals

Can also use the distribution

boott(x, CV)

$confpoints 0.001 0.01 0.025 0.05 0.1[1,] 0.1103237 0.1438075 0.1640236 0.1723388 0.189396 0.5 0.9 0.95 0.975 0.99[1,] 0.2633288 0.4086446 0.4722277 0.54459 0.6026483 0.999[1,] 0.6582787

$thetaNULL

$gNULL

t

37

Page 38: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Cross-validation

Evaluating a model on the same data results in overfitting

�e production of an analysis that corresponds too closely or exactly to aparticular set of data, and may therefore fail to fit additional data orpredict future observations reliably

38

Page 39: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

39

Page 40: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Prediction error

Let be the response and the prediction based on a model andpredictors .

Two common choices for quantifying prediction error (should useproper scoring rule)

yi y i

xi

D(y, y) = (y − y)2

D(y, y) = {1 if y ≠ y

0 if y = y

40

Page 41: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Error rates

True error rate is (for new observation and fixed prediction method)

�e apparent error rate is

y0

E(D(y0, y0))

∑D(yi, y i)1

N

41

Page 42: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Cross validation

�e problem with the apparent error rate, , is that theprediction model is estimated from the same data as the evaluation. �eapparent error becomes too small

We would like to evaluate on an independent validation set to getunbiased estimate.

Cross-valiation tries to obtain validation data from the original data.

∑D(yi, y i)1N

42

Page 43: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Leave-one-out (LOO) Cross validation

1. Drop data , refit the model and get a prediction model2. Compute 3. Do that for each .

Compute the CV error rate

(yi, xi)

D(yi, y i)

i = 1, … , N

ErrCV = ∑D(yi, y i)1

N

43

Page 44: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Leave-one-out (LOO) Cross validation

library(mlbench)data("PimaIndiansDiabetes2")pima <- PimaIndiansDiabetes2[complete.cases(PimaIndiansDiabetes2),]res2 <- glm(diabetes ~ age + mass + insulin + pregnant, family=binomial, data=pima)library(boot)cv <- cv.glm(pima, res2) ## Default LOO CVcv$delta ## 1st is raw CV error est, 2nd is bias-corrected

[1] 0.1797992 0.1797928

44

Page 45: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

-fold Cross-validation

cv <- cv.glm(pima, res2, K=10) ## 10-fold CVcv$delta ## 1st is raw CV error est

[1] 0.1796179 0.1793682

Smaller results in fewer builds of prediction modelSmaller results in groups of data that potentially vary more -greater variation between prediction model.

K

K

K

45

Page 46: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Penalized regressionPenalized regression

4646

Page 47: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Penalized regression models

Recall the generalized linear model

or equivalently

where the function is in principle any function that maps thepopulation mean into the linear predictor. is called the link function

g(E[Yi]) = β0 + β1X1i + ⋯ + βP XPi = Xβ

E[Yi] = g−1(β0 + β1X1i + ⋯ + βP XPi) = g−1(Xβ)

g

g

47

Page 48: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Estimator in linear regression

�e ordinary least squares estimator is

Written as least squares, so minimizes

What do we do then is large?

β = (X tX)−1X ty

β

(y − Xβ)t(y − Xβ) = ∥y − Xβ∥22

P

48

Page 49: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Penalized regression - the lasso

Assume linear mean e�fect:

�e Lasso estimates by minimizing the penalized LS function

so

Y = Xβ + ϵ

β

Zn(β) = (Y − Xβ)t(Y − Xβ) + λN ∥β∥

= ∥(Y − Xβ)∥22 + λN ∥β∥1

1

N

β = arg minβ∈RP Zn(β)

49

Page 50: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Properties of the lasso

Even for or regularized regression methods can:

select a sparse modellead to accurate prediction

Limitations of lasso type regularization:

not consistent in variable selectionnon-standard limiting distributionno oracle propertymultiple testing problem

P > N P ≫ N

50

Page 51: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Example: Biopsy Data on Breast Cancer Patients

Biopsies of breast tumours for 699 patients up to 15 July 1992; each ofnine attributes has been scored on a scale of 1 to 10, and the outcome isalso known.

Predictors: clump thickness, uniformity of cell size, uniformity of cellshape, marginal adhesion, single epithelial cell size, bare nuclei (16values are missing), bland chromatin, normal nucleoli, mitoses.

Output is "benign" or "malignant".

51

Page 52: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Example: The biopsy data

library("MASS") # Get datadata(biopsy)# Remove missing variablesccb <- complete.cases(biopsy)predictors <- as.matrix(scale(biopsy[ccb,2:10])) y <- as.numeric(biopsy$class[ccb])-1

library("glmnet")res <- glmnet(predictors, y, family="binomial")

52

Page 53: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

plot(res, lwd=plot(res, lwd=33, xvar=, xvar="lambda""lambda"))

5353

Page 54: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Example: The biopsy data

coef(res, s=.3)

10 x 1 sparse Matrix of class "dgCMatrix" 1(Intercept) -0.6425224V1 . V2 0.1394930V3 0.1102769V4 . V5 . V6 0.2221975V7 . V8 . V9 .

54

Page 55: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

res2 <- cv.glmnet(predictors, y, family=res2 <- cv.glmnet(predictors, y, family="binomial""binomial"))plot(res2, lwd=plot(res2, lwd=33, cex=, cex=1.81.8))

5555

Page 56: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Example: The biopsy data

coef(res, s=exp(-3.3))

10 x 1 sparse Matrix of class "dgCMatrix" 1(Intercept) -0.93594535V1 0.61727404V2 0.44331361V3 0.48275683V4 0.14338234V5 0.05694629V6 0.94445772V7 0.36293816V8 0.28012376V9 .

56

Page 57: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Delassoing

We know the lasso estimates are shrunk towards zero.

What if we want optimized prediction?

Ad hoc approach: delassoing

1. Use lasso for predictor selection2. Refit model with only selected predictors to obtain unbiased

estimates

57

Page 58: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

predictors <- scale(predictors)res <- glm(y ~ predictors, family="binomial")tidy(res)

# A tibble: 10 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) -1.09 0.323 -3.38 0.000714 2 predictorsV1 1.51 0.401 3.77 0.000165 3 predictorsV2 -0.0192 0.641 -0.0300 0.976 4 predictorsV3 0.964 0.689 1.40 0.162 5 predictorsV4 0.947 0.354 2.68 0.00740 6 predictorsV5 0.215 0.348 0.617 0.537 7 predictorsV6 1.40 0.342 4.08 0.0000447 8 predictorsV7 1.10 0.420 2.61 0.00907 9 predictorsV8 0.650 0.345 1.89 0.0591 10 predictorsV9 0.927 0.570 1.63 0.104

58

Page 59: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Other penalties

�e Ridge regression estimates by minimizing the penalized LSfunction

this corresponds toadding a small value to the diagonal elements of thecorrelation matrix of the parameters. Introduces bias but reducesvariation. No selection but all predictors scaled down.

β

Zn(β) = ∥(Y − Xβ)∥22 + λN ∥β∥2

1

59

Page 60: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

res <- glmnet(predictors, y, alpha = res <- glmnet(predictors, y, alpha = 00))plot(res, lwd=plot(res, lwd=33))

6060

Page 61: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Elastic net

Combine the sparsity from lasso and the �lexibility of ridge regression.

Zn(β) = ∥(Y − Xβ)∥22 + αλN ∥β∥1 + (1 − α)λN ∥β∥2

1

61

Page 62: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

res <- glmnet(predictors, y, alpha = res <- glmnet(predictors, y, alpha = .3.3))plot(res, lwd=plot(res, lwd=33))

6262

Page 63: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

Regularization and statistical inference

library("selectiveInference")res <- glmnet(predictors, y, family="binomial")lambda <- exp(-4.5)beta <- coef(res, x=predictors, y=y, s=lambda/683, exact=T)res <- fixedLassoInf(predictors, y, beta, lambda, family="binomial", alpha=0.05)

Warning in fixedLogitLassoInf(x, y, beta, lambda, alpha= alpha, type = type, : Solution beta does not satisfythe KKT conditions (to within specified tolerances)

63

Page 64: Advanced Statistical Topics A · 11/4/2019  · Advanced Statistical Topics A Day 1 - High dimensional data Claus Thorn Ekstrøm UCPH Biostatistics ekstrom@sund.ku.dk November 4th,

res

Call:fixedLassoInf(x = predictors, y = y, beta = beta, lambda = lambda, family = "binomial", alpha = 0.05)

Testing results at lambda = 0.011, with alpha = 0.050

Var Coef Z-score P-value LowConfPt UpConfPt 1 1.509 3.770 0.596 -9.749 2.005 2 -0.019 -0.030 0.989 0.857 Inf 3 0.964 1.399 0.962 -Inf 0.570 4 0.947 2.680 0.808 -17.165 1.237 5 0.215 0.617 0.863 -10.079 0.635 6 1.396 4.084 0.000 0.743 3.114 7 1.095 2.611 0.749 -14.329 1.537 8 0.650 1.889 0.805 -12.461 1.008 9 0.927 1.629 0.705 -10.952 1.724 LowTailArea UpTailArea 64