advanced statistical topics a · 11/4/2019 · advanced statistical topics a day 1 - high...
TRANSCRIPT
Advanced Statistical Topics AAdvanced Statistical Topics A
Day 1 - High dimensional dataDay 1 - High dimensional data
Claus Thorn EkstrømClaus Thorn EkstrømUCPH BiostatisticsUCPH [email protected]@sund.ku.dk
November 4th, 2019November 4th, 2019Slides @ Slides @ course websitecourse website
11
Practicalities
Every day from 8.15 to 15 in room 2.1.12Internet access: eduroam or KU-guestRBreaks as we go alongPlease read before classes
2
1. High-dimensional data,penalized regression,bootstrap, cross-validation
2. Multiple imputationtechniques
3. Classification/regression tress,random forests
4. Dimension reduction,randomization tests
web.stanford.edu/~hastie/CASI/
Course overview
3
Quick refresher on statistical modelsQuick refresher on statistical models
44
Multiple regression / linear models refresher
Model as a function of predictors . For individual
or in matrix notation
�e overall trend or mean e�fect is
y x1, … , xP i
yi = β0 + β1x1i + β2x2i + ⋯ + βP xPi + ϵi
Y = Xβ + ϵ
E(Y ) = Xβ
5
Interpretation of parameters
Each coe�ficient, , expresses the expected change in the response when changes one unit and the remaining explanatory variablesare held fixed.Might be that we have a significant e�fect of in the univariatemodel, while its e�fect in the mult. lin. model is non-significant. Itse�fect is explained by the other explanatory variables.
βj y
xj
x1
6
Generalized linear models
Extends the linear more to other outcomes and distributions. Still focuson the mean e�fect but for the transformed data.
or equivalently
where the link function in principle could be any function that mapsthe population mean into the linear predictor.
g(E[Yi]) = β0 + β1X1i + ⋯ + βP XPi
E[Yi] = g−1(β0 + β1X1i + ⋯ + βP XPi)
g
7
Example: FEV in children
E�fect of smoking on the pulmonary function of children. . Infoon age, height, sex, smoking, and fev (forced expiratory volume).
N = 654
8
library("broom")res <- lm(FEV ~ Age + Ht + I(Ht^2) + Gender + Smoke, data=fev)tidy(res)
# A tibble: 6 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl>1 (Intercept) 6.83 1.50 4.54 6.68e- 62 Age 0.0700 0.00913 7.67 6.31e-143 Ht -10.7 1.96 -5.46 6.85e- 84 I(Ht^2) 4.81 0.635 7.58 1.21e-135 GenderBoy 0.0939 0.0330 2.85 4.51e- 36 SmokeYes -0.136 0.0573 -2.38 1.76e- 2
Each unit change in an produces an average e�fect in .x y
9
Example: Who survived the Titanic disaster?
individuals on the Titanic. Who survived? Info on Class (1st,2nd, 3rd, Crew), Sex, Age group, and survival status.
and then
N = 2201
log( ) = Xβ = β0 + β1x1 + ⋯ + β5x5P(Y = 1)
1 − P(Y = 1)
P(Y = 1) =exp(β0 + β1x1 + ⋯ + β5x5)
1 + exp(β0 + β1x1 + ⋯ + β5x5)
10
res <- glm(Survived ~ Class + Sex + Age, data=titanic, family=binomial)tidy(res)
# A tibble: 6 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl>1 (Intercept) 0.685 0.273 2.51 1.21e- 22 Class2nd -1.02 0.196 -5.19 2.05e- 73 Class3rd -1.78 0.172 -10.4 3.69e-254 ClassCrew -0.858 0.157 -5.45 5.00e- 85 SexFemale 2.42 0.140 17.2 1.43e-666 AgeAdult -1.06 0.244 -4.35 1.36e- 5
Estimates are log odds ratios. Use exp() to transform back to oddsratios.
11
The joy of big data
12
13
Today's topicsToday's topics
BootstrapBootstrapCross validationCross validationPenalized regressionPenalized regression
Handle complicated modeling aspects with many potential predictors.Handle complicated modeling aspects with many potential predictors.
1414
Evaluating complex methods and data
When we have complex data (og perhaps maybe just big data combinedwith simple methods) and non-parametric methods then we still with toevaluate them.
How stable are the results?
15
The bootstrap/jackknife procedures
Whenever we provide an estimate (mean, proportion, ...) we also want toinfer its precision!
We may or may not be able to formulate a full parametric (or semi-parametric model).
�e bootstrap procedure allows us to estimate the standard error even incomplicated situations or for non-standard statistics.
16
Statistics 101: populations and sample
17
Statistics 101: multiple samples
Di�ferent samples will result in di�ferent outcomes.
If we had the means to produce several samples we would know thesampling distribution.
18
Sample variation
19
Sample variation
20
Sample variation
21
Sample variation
22
Resampling
Have observations and an estimate
for some estimation algorithm.
Want the SE of .
xi ∼ F
θ = s(x)
θ
23
The jackknife estimate
Estimate times - once with each observation removed.
�en the jackknife estimator is
Reduces to the standard error if is a sample average.
θ N
θ (−i) = s(x(−i))
SE(θ) =
⎷
N
∑i=1
(θ (−i) − ^θ (−))2N − 1
N
θ
24
The jackknife estimate
Non-parametric, no assumptions on , samples of size In general: the jackknife standard error is upwardly biased.Only to be used with smooth, di�ferentiable statistic
x <-c(8.26, 6.33, 10.4, 5.27, 5.35, 5.61, 6.12, 6.19, 5.2, 7.01, 8.74, 7.78, 7.02, 6, 6.5, 5.8, 5.12, 7.41, 6.52, 6.21, 12.28, 5.6, 5.38, 6.6, 8.74)CV <- function(x) sqrt(var(x))/mean(x)CV(x)
[1] 0.2524712
library("bootstrap") ; res <- jackknife(x, CV)
F N − 1
25
The jackknife estimate
res
$jack.se[1] 0.05389943
$jack.bias[1] -0.009266436
$jack.values [1] 0.2563873 0.2565586 0.2384298 0.2507329 0.2513200 [6] 0.2530603 0.2557374 0.2560293 0.2501992 0.2580969[11] 0.2541045 0.2577524 0.2581067 0.2551946 0.2571038[16] 0.2541711 0.2495662 0.2581975 0.2571609 0.2561093[21] 0.2020978 0.2529980 0.2515338 0.2573745 0.2541045
$calljackknife(x = x, theta = CV)
26
Estimating bias
When is unbiased then
But if the procedure has bias
then we can estimate the size of the bias from jackknife results.
θ
E(^θ) =N
∑i=1
E(θ (−i)) = θ1
N
E(θ) = θ + + + resta
N
b
N 2
27
Estimating bias
�us,
Furthermore, the bias-corrected jackknife estimate,
is an unbiased estimate of up to second order.
E(^θ − θ) = + resta
N(N − 1)
biasjack = (N − 1)(^θ − θ) = .a
N
θ jack = θ − biasjack
θ 28
Nonparametric bootstrap
Use the sample as the population and generate "fake samples"
Population -> Sample -> "Fake sample"
29
Nonparametric bootstrap
Get a random bootstrap sample from the sample with replacement
�en we can get
Do that times and get information about the full distribution.
x∗ = (x∗1, x∗
2, … , x∗N
)
θ∗
= s(x∗)
B
30
Jackknife vs bootstrap
Jackknife provides stable results (will always get the same result)whereas bootstrap varies.Jackknife only estimates the variance of the point estimator whereasthe bootstrap provides information on the distribution.
SE = √∑(θ∗b
− θ∗−
)2
B − 1
31
Nonparametric bootstrap in R
results <- bootstrap(x, 200, CV)
32
Parametric bootstrap
�e nonparametric bootstrap made no assumptions about thedistribution.
If distribution is known then use that information.
Fit model to dataDraw samples of random numbers from the fitted modeUse those for bootstrap
Useful for small sample sizes (assuming the model holds), di�ficultevaluations, ...
B
33
Parametric bootstrap - resample residuals
Fit model to data, keep predictions and compute a vector ofresiduals, .Create new sets of observations using a randomresidual.Refit the model using the new set of response variables, and computethe statisticDo that times
y i
ϵ i = yi − y i
y∗ = y i + ϵ j
B
34
Bootstrap con�dence intervals
Standard 95% confidence intervals
Could get that directly from the bootstrap results.
mean(results$thetastar) + c(-1.96, 1.96)*sd(results$thetastar)
[1] 0.1497948 0.3262128
Only really works if the distribution is symmetric
θ ± 1.96SE
35
Bootstrap percentile con�dence intervals
Generate the "full" distribution. Cut o�f 2.5% at each end. Use animprovement that depends on the precision of the percentiles.
bcanon(x, 2000, CV)
$confpoints alpha bca point[1,] 0.025 0.1833005[2,] 0.050 0.1957387[3,] 0.100 0.2111287[4,] 0.160 0.2241942[5,] 0.840 0.3100553[6,] 0.900 0.3267094[7,] 0.950 0.3398162[8,] 0.975 0.3699479
$z036
Bootstrap percentile con�dence intervals
Can also use the distribution
boott(x, CV)
$confpoints 0.001 0.01 0.025 0.05 0.1[1,] 0.1103237 0.1438075 0.1640236 0.1723388 0.189396 0.5 0.9 0.95 0.975 0.99[1,] 0.2633288 0.4086446 0.4722277 0.54459 0.6026483 0.999[1,] 0.6582787
$thetaNULL
$gNULL
t
37
Cross-validation
Evaluating a model on the same data results in overfitting
�e production of an analysis that corresponds too closely or exactly to aparticular set of data, and may therefore fail to fit additional data orpredict future observations reliably
38
39
Prediction error
Let be the response and the prediction based on a model andpredictors .
Two common choices for quantifying prediction error (should useproper scoring rule)
yi y i
xi
D(y, y) = (y − y)2
D(y, y) = {1 if y ≠ y
0 if y = y
40
Error rates
True error rate is (for new observation and fixed prediction method)
�e apparent error rate is
y0
E(D(y0, y0))
∑D(yi, y i)1
N
41
Cross validation
�e problem with the apparent error rate, , is that theprediction model is estimated from the same data as the evaluation. �eapparent error becomes too small
We would like to evaluate on an independent validation set to getunbiased estimate.
Cross-valiation tries to obtain validation data from the original data.
∑D(yi, y i)1N
42
Leave-one-out (LOO) Cross validation
1. Drop data , refit the model and get a prediction model2. Compute 3. Do that for each .
Compute the CV error rate
(yi, xi)
D(yi, y i)
i = 1, … , N
ErrCV = ∑D(yi, y i)1
N
43
Leave-one-out (LOO) Cross validation
library(mlbench)data("PimaIndiansDiabetes2")pima <- PimaIndiansDiabetes2[complete.cases(PimaIndiansDiabetes2),]res2 <- glm(diabetes ~ age + mass + insulin + pregnant, family=binomial, data=pima)library(boot)cv <- cv.glm(pima, res2) ## Default LOO CVcv$delta ## 1st is raw CV error est, 2nd is bias-corrected
[1] 0.1797992 0.1797928
44
-fold Cross-validation
cv <- cv.glm(pima, res2, K=10) ## 10-fold CVcv$delta ## 1st is raw CV error est
[1] 0.1796179 0.1793682
Smaller results in fewer builds of prediction modelSmaller results in groups of data that potentially vary more -greater variation between prediction model.
K
K
K
45
Penalized regressionPenalized regression
4646
Penalized regression models
Recall the generalized linear model
or equivalently
where the function is in principle any function that maps thepopulation mean into the linear predictor. is called the link function
g(E[Yi]) = β0 + β1X1i + ⋯ + βP XPi = Xβ
E[Yi] = g−1(β0 + β1X1i + ⋯ + βP XPi) = g−1(Xβ)
g
g
47
Estimator in linear regression
�e ordinary least squares estimator is
Written as least squares, so minimizes
What do we do then is large?
β = (X tX)−1X ty
β
(y − Xβ)t(y − Xβ) = ∥y − Xβ∥22
P
48
Penalized regression - the lasso
Assume linear mean e�fect:
�e Lasso estimates by minimizing the penalized LS function
so
Y = Xβ + ϵ
β
Zn(β) = (Y − Xβ)t(Y − Xβ) + λN ∥β∥
= ∥(Y − Xβ)∥22 + λN ∥β∥1
1
N
β = arg minβ∈RP Zn(β)
49
Properties of the lasso
Even for or regularized regression methods can:
select a sparse modellead to accurate prediction
Limitations of lasso type regularization:
not consistent in variable selectionnon-standard limiting distributionno oracle propertymultiple testing problem
P > N P ≫ N
50
Example: Biopsy Data on Breast Cancer Patients
Biopsies of breast tumours for 699 patients up to 15 July 1992; each ofnine attributes has been scored on a scale of 1 to 10, and the outcome isalso known.
Predictors: clump thickness, uniformity of cell size, uniformity of cellshape, marginal adhesion, single epithelial cell size, bare nuclei (16values are missing), bland chromatin, normal nucleoli, mitoses.
Output is "benign" or "malignant".
51
Example: The biopsy data
library("MASS") # Get datadata(biopsy)# Remove missing variablesccb <- complete.cases(biopsy)predictors <- as.matrix(scale(biopsy[ccb,2:10])) y <- as.numeric(biopsy$class[ccb])-1
library("glmnet")res <- glmnet(predictors, y, family="binomial")
52
plot(res, lwd=plot(res, lwd=33, xvar=, xvar="lambda""lambda"))
5353
Example: The biopsy data
coef(res, s=.3)
10 x 1 sparse Matrix of class "dgCMatrix" 1(Intercept) -0.6425224V1 . V2 0.1394930V3 0.1102769V4 . V5 . V6 0.2221975V7 . V8 . V9 .
54
res2 <- cv.glmnet(predictors, y, family=res2 <- cv.glmnet(predictors, y, family="binomial""binomial"))plot(res2, lwd=plot(res2, lwd=33, cex=, cex=1.81.8))
5555
Example: The biopsy data
coef(res, s=exp(-3.3))
10 x 1 sparse Matrix of class "dgCMatrix" 1(Intercept) -0.93594535V1 0.61727404V2 0.44331361V3 0.48275683V4 0.14338234V5 0.05694629V6 0.94445772V7 0.36293816V8 0.28012376V9 .
56
Delassoing
We know the lasso estimates are shrunk towards zero.
What if we want optimized prediction?
Ad hoc approach: delassoing
1. Use lasso for predictor selection2. Refit model with only selected predictors to obtain unbiased
estimates
57
predictors <- scale(predictors)res <- glm(y ~ predictors, family="binomial")tidy(res)
# A tibble: 10 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) -1.09 0.323 -3.38 0.000714 2 predictorsV1 1.51 0.401 3.77 0.000165 3 predictorsV2 -0.0192 0.641 -0.0300 0.976 4 predictorsV3 0.964 0.689 1.40 0.162 5 predictorsV4 0.947 0.354 2.68 0.00740 6 predictorsV5 0.215 0.348 0.617 0.537 7 predictorsV6 1.40 0.342 4.08 0.0000447 8 predictorsV7 1.10 0.420 2.61 0.00907 9 predictorsV8 0.650 0.345 1.89 0.0591 10 predictorsV9 0.927 0.570 1.63 0.104
58
Other penalties
�e Ridge regression estimates by minimizing the penalized LSfunction
this corresponds toadding a small value to the diagonal elements of thecorrelation matrix of the parameters. Introduces bias but reducesvariation. No selection but all predictors scaled down.
β
Zn(β) = ∥(Y − Xβ)∥22 + λN ∥β∥2
1
59
res <- glmnet(predictors, y, alpha = res <- glmnet(predictors, y, alpha = 00))plot(res, lwd=plot(res, lwd=33))
6060
Elastic net
Combine the sparsity from lasso and the �lexibility of ridge regression.
Zn(β) = ∥(Y − Xβ)∥22 + αλN ∥β∥1 + (1 − α)λN ∥β∥2
1
61
res <- glmnet(predictors, y, alpha = res <- glmnet(predictors, y, alpha = .3.3))plot(res, lwd=plot(res, lwd=33))
6262
Regularization and statistical inference
library("selectiveInference")res <- glmnet(predictors, y, family="binomial")lambda <- exp(-4.5)beta <- coef(res, x=predictors, y=y, s=lambda/683, exact=T)res <- fixedLassoInf(predictors, y, beta, lambda, family="binomial", alpha=0.05)
Warning in fixedLogitLassoInf(x, y, beta, lambda, alpha= alpha, type = type, : Solution beta does not satisfythe KKT conditions (to within specified tolerances)
63
res
Call:fixedLassoInf(x = predictors, y = y, beta = beta, lambda = lambda, family = "binomial", alpha = 0.05)
Testing results at lambda = 0.011, with alpha = 0.050
Var Coef Z-score P-value LowConfPt UpConfPt 1 1.509 3.770 0.596 -9.749 2.005 2 -0.019 -0.030 0.989 0.857 Inf 3 0.964 1.399 0.962 -Inf 0.570 4 0.947 2.680 0.808 -17.165 1.237 5 0.215 0.617 0.863 -10.079 0.635 6 1.396 4.084 0.000 0.743 3.114 7 1.095 2.611 0.749 -14.329 1.537 8 0.650 1.889 0.805 -12.461 1.008 9 0.927 1.629 0.705 -10.952 1.724 LowTailArea UpTailArea 64