lecture 6: variable selection - lassolu/st7901/lecture notes/2019lect6_shrink... · oracle...
TRANSCRIPT
-
Lecture 6: Variable Selection - LASSO
Wenbin Lu
Department of StatisticsNorth Carolina State University
Fall 2019
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 1 / 45
-
Outline
Modern Penalization Methods
Lq penalty, ridge
LASSO
computation, solution path, LARS algorithmsparameter tuning, cross validationtheoretical properties
ridge regression
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 2 / 45
-
Oracle Properties
Notations
data (Xi ,Yi ), i = 1, · · · , nn: sample size
d : the number of predictors, Xi ∈ Rd .the full index set S = {1, 2, · · · , d}.the selected index set given by a procedure is Â, its size is |Â|.the linear coefficient vector β = (β1, · · · , βd)T .the true linear coefficients β0 = (β10, · · · , βd0)T .the true model A0 = {j : j = 1, · · · , d , |βj0| 6= 0}.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 3 / 45
-
Oracle Properties Oracle Properties
Motivations
Shrinkage method for variable selection a continuous process
more stable, not suffering from high variability.
one unified selection and estimation framework; easy to makeinferences
advances in theories: oracle properties, valid asymptotic inferences
In general, we standardize the inputs first before applying the methods.
X ’s are centered to mean 0 and variance 1
y is centered to mean 0
We often fit a model without an intercept.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 4 / 45
-
Oracle Properties Oracle Properties
Ideal Variable Selection
The best results we can expect from a selection procedure:
able to obtain the correct model structure
keep all the important variables in the modelfilter all the noisy variable out of the model
has some optimal inferential properties
consistent, optimal convergence rateasymptotic normality, most efficient
True model: yi = β00 +∑d
j=1 xijβj0 + �i ,
important index set: A0 = {j : βj0 6= 0, j = 1, · · · , d}unimportant index set: Ac0 = {j : βj0 = 0, j = 1, · · · , d}
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 5 / 45
-
Oracle Properties Oracle Properties
Oracle Properties
An oracle performs as if the true model were known:
1 selection consistency
β̂j 6= 0 for j ∈ A0; β̂j = 0 for j ∈ Ac0;
2 estimation consistency:
√n(β̂A0 − βA0)→d N(0,ΣI ),
βA0 = {βj , j ∈ A0} and ΣI is the covariance matrix if knowing thetrue model.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 6 / 45
-
Shrinkage Methods
Penalized Regularization Framework
One revolutionary work is the development of regularization framework:
minβ
L(β; y,X ) + λJ(β)
L(β; y,X ) is the loss function
For OLS, L =∑n
i=1[yi − β0 −∑d
j=1 βjxij ]2
For MLE methods, L = log likelihood
For Cox’s PH models, L is the partial likelihood
In supervised learning, L is the hinge loss function (SVM), orexponential loss (AdaBoost)
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 7 / 45
-
Shrinkage Methods
Lq Shrinkage Penalty
J(β) is the penalty function.
Jq(|β|) = λ‖β‖qq =d∑
j=1
|βj |q, q ≥ 0.
J0(|β|) =∑d
j=1 I (βj 6= 0) (L0; Donoho and Johnstone 1988)
J1(|β|) =∑d
j=1 |βj | (LASSO; Tibshirani 1996)
J2(|β|) =∑d
j=1 β2j (Ridge; Hoerl and Kennard, 1970)
J∞(|β|) = maxj |βj | (Supnorm penalty; Zhang et al. 2008)
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 8 / 45
-
Shrinkage Methods
Elements of Statistial Learning
Hastie, Tibshirani & Friedman 2001 Chapter 3
PSfrag replaementsq = 4 q = 2 q = 1 q = 0:5 q = 0:1
Figure 3.13: Contours of onstant value of Pj j�j jqfor given values of q.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 9 / 45
-
Shrinkage Methods LASSO
The LASSO
Applying the absolute penalty on the coefficients
minβ
n∑i=1
(yi −d∑
j=1
xijβj)2 + λ
d∑j=1
|βj |
λ ≥ 0 is a tuning parameterλ controls the amount of shrinkage; the larger λ, the greater amountof shrinkage
What happens if λ→ 0? (no penalty)What happens if λ→∞?
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 10 / 45
-
Shrinkage Methods LASSO
Equivalent formulation for LASSO
minβ
n∑i=1
(yi −d∑
j=1
xijβj)2, subject to
d∑j=1
|βj | ≤ t.
Explicitly apply the magnitude constraint on the parameters
Making t sufficiently small will cause some coefficients to be exactlyzero.
The lasso is a kind of continuous subset selection
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 11 / 45
-
Shrinkage Methods LASSO
Elements of Statistial Learning
Hastie, Tibshirani & Friedman 2001 Chapter 3
β^ β^2. .β
1
β2
β1 β
Figure 3.12: Estimation piture for the lasso (left)and ridge regression (right). Shown are ontours of theerror and onstraint funtions. The solid blue areas arethe onstraint regions j�1j+ j�2j � t and �21 + �22 � t2,respetively, while the red ellipses are the ontours ofthe least squares error funtion.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 12 / 45
-
Shrinkage Methods LASSO
LASSO Solution: soft thresholding
For the orthogonal design case with X ′X = I , the LASSO method isequivalent to: solve βj ’s componentwisely by solving
minβj
(βj − β̂lsj )2 + λ|βj |
The solution to the above problem is
β̂lassoj = sign(β̂lsj )(|β̂lsj | −
λ
2)+
=
β̂lsj −
λ2 if β̂
lsj >
λ2
0 if |β̂lsj | ≤λ2
β̂lsj +λ2 if β̂
lsj < −
λ2
shrinks big coefficients by a constant λ2 towards zero.
truncates small coefficients to zero exactly.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 13 / 45
-
Shrinkage Methods LASSO
Comparison: Different Methods Under Orthogonal Design
When X is orthonormal, i.e. XTX = I
Best subset (of size M): β̂lsj if rank(|β̂lsj |) ≤ M.keeps the largest coefficients
Ridge regression: β̂lsj /(1 + λ)
does a proportional shrinkage
Lasso: sign(β̂lsj )(|β̂lsj | −λ2 )+
transform each coefficient by a constant factor first, then truncate it atzero with a certain threshold“soft thresholding”, used often in wavelet-based smoothing
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 14 / 45
-
Shrinkage Methods Parameter Tuning of LASSO
How to Choose t or λ
The tuning parameter t or λ should be adaptively chosen to minimize theMSE or PE. Common tuning methods include
the tuning error, cross validation, leave-out-one cross validation(LOOCV), AIC, or BIC.
If t is chosen larger than t0 =∑d
j=1 |β̂lsj |, then β̂lassoj = β̂lsj .If t = t0/2, the least squares coefficients are shrunk by 50%.
In practice, we sometimes standardize t by s = t/t0 in [0, 1], and choosethe best value s from [0, 1].
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 15 / 45
-
Shrinkage Methods Parameter Tuning of LASSO
K-fold Cross Validation
The simplest and most popular way to estimate the tuning error.1 Randomly split the data into K roughly equal parts2 For each k = 1, ...,K
leave the kth portion out, and fit the model using the other K − 1parts. Denote the solution by β̂−k
calculate prediction errors of β̂−k on the left-out kth portion
3 Average the prediction errors
Define the Index function (allocating memberships)
κ : {1, ..., n} → {1, ...,K}.
The K -fold cross-validation estimate of prediction error (PE) is
CV =1
n
n∑i=1
(yi − xTi β̂−κ(i)
)2Typical choice: K = 5, 10
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 16 / 45
-
Shrinkage Methods Parameter Tuning of LASSO
Choose λ
Choose a sequence of λ values.
For each λ, fit the LASSO and denote the solution by β̂lassoλ .
Compute the CV (λ) curve as
CV (λ) =1
n
n∑i=1
(yi − xTi β̂
−κ(i)λ
),
which provides an estimate of the test error curve.
Find the best parameter λ∗ which minimizes CV (λ).
Fit the final LASSO model with λ∗. The final solution is denoted asβ̂lassoλ∗
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 17 / 45
-
Shrinkage Methods Parameter Tuning of LASSO
�����������
������������������������ �!���"�#�$�&% �')( �*�+����-,/.���01��2#�3"�-�#��4 5#"����768�9�-�;:�
-
Shrinkage Methods Parameter Tuning of LASSO
One-standard Error for CV
One-standard error rule:
choose the most parsimonious model with error no more than onestandard error above the best error.
used often in model selection (for a smaller model size).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 19 / 45
-
Shrinkage Methods Parameter Tuning of LASSO
�����������
������������������������ �!���"�#�$�&% �')( �*�+����-,/.���01��2#�3"�-�#��4 5#"����768�9�-�;:�
-
Shrinkage Methods Computation of LASSO
Computation of LASSO
Quadratic Programming
Shooting Algorithms (Fu 1998; Zhang and Lu, 2007)
LARS solution path (Efron et al. 2004)
the most efficient algorithm for LASSOdesigned for standard linear modelsR package ”lars”
Coordinate Descent Algorithm (CDA; Friedman et al. 2010)
designed for GLMsuitable for ultra-high dimensional problemsR package ”glmnet”
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 21 / 45
-
Shrinkage Methods Computation of LASSO
R package: lars
Provides the entire sequence of coefficients and fits, starting from zero, tothe least squares fit.
library(lars)
lars(x, y, type = c("lasso", "lar", "forward.stagewise",
"stepwise"), trace = FALSE, normalize = TRUE,
intercept = TRUE, Gram, eps = .Machine$double.eps,
max.steps, use.Gram = TRUE)
A “lars” object is returned, for which print, plot, predict, coef andsummary methods exist.
Details in Efron et al. (2004).
With ”lasso” option, it computes the complete lasso solutionsimultaneously for ALL values of the shrinkage parameter (λ or t ors) in the same computational cost as a least squares fit.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 22 / 45
-
Shrinkage Methods Computation of LASSO
Details About LARS function
x : matrix of predictors
y : response
type: One of ”lasso”, ”lar”, ”forward.stagewise” or ”stepwise”, allvariants of Lasso. Default is ”lasso”.
trace: if TRUE, lars prints out its progress
normalize: if TRUE, each variable is standardized to have unit L2norm, otherwise it is left alone. Default is TRUE.
intercept: if TRUE, an intercept is included (not penalized),otherwise no intercept is included. Default is TRUE.
Gram: XTX matrix; useful for repeated runs (bootstrap) where alarge XTX stays the same.
eps: An effective zero
max.steps: the number of steps taken. Default is8 min(d , n − intercept), n is the sample size.Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 23 / 45
-
Shrinkage Methods Computation of LASSO
Elements of Statistial Learning
Hastie, Tibshirani & Friedman 2001 Chapter 3
Shrinkage Factor s
Coeff
icients
0.0 0.2 0.4 0.6 0.8 1.0
-0.2
0.00.2
0.40.6
•
•
•
•
•
•
••
•• • • • •
• •• •
• •• •
• •• lcavol
• • • • ••
••
•• •
• •• •
• • • • •• • • • • lweight
• • • • • • • • • • • • • ••
• • • • • • • • • •age
• • • • • • • • • ••
••
•• • •
• • • •• • • • lbph
• • • • • • ••
••
••
•• •
• •• •
• •• •
• •svi
• • • • • • • • • • • • • • ••
••
••
••
••
• lcp
• • • • • • • • • • • • • • • • • • • • • • • • •gleason• • • • • • • • • •
••
•• •
• •• •
• •• •
••pgg45
Figure 3.9: Pro�les of lasso oeÆients, as tuningparameter t is varied. CoeÆients are plotted versuss = t=Pp1 j�̂j j. A vertial line is drawn at s = 0:5, thevalue hosen by ross-validation. Compare Figure 3.7on page 7; the lasso pro�les hit zero, while those forridge do not.Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 24 / 45
-
Shrinkage Methods Computation of LASSO
Prediction for LARS
predict(object, newx, s, type = c("fit", "coefficients"),
mode = c("step", "fraction", "norm", "lambda"), ...)
coef(object, ...)
object: A fitted lars object
newx: If type=“fit”, then newx should be the x values at which thefit is required. If type=“coefficients”, then newx can be omitted.
s: a value, or vector of values, indexing the path. Its values dependson the mode . By default (mode=“step”), s takes on values between0 and p.
type: If type=“fit”, predict returns the fitted values. Iftype=”coefficients”, predict returns the coefficients.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 25 / 45
-
Shrinkage Methods Computation of LASSO
“Mode” for Prediction in LARS
If mode=“step”, then s should be the argument indexes the lars stepnumber, and the coefficients will be returned corresponding to thevalues corresponding to step s.
If mode=“fraction”, then s should be a number between 0 and 1, andit refers to the ratio of the L1 norm of the coefficient vector, relativeto the norm at the full LS solution.
If mode=“norm”, then s refers to the L1 norm of the coefficientvector.
If mode= λ, then s is the lasso regularization parameter
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 26 / 45
-
Shrinkage Methods Computation of LASSO
Example: Diabetes Data
library(lars)
data(diabetes)
attach(diabetes)
## fit LASSO
object
-
Shrinkage Methods Computation of LASSO
R Code to Compute K-CV Error Curve for Lars
cv.lars(x, y, K = 10, index, type = c("lasso",
"lar", "forward.stagewise", "stepwise"),
mode=c("fraction", "step"), ...)
x : Input to lars
y : Input to lars
K : Number of folds
index: Abscissa values at which CV curve should be computed. Ifmode=”fraction”, this is the fraction of the saturated |β|. Thedefault value index=seq(from = 0, to = 1, length = 100). Ifmode=”step”, this is the number of steps in lars procedure.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 28 / 45
-
Shrinkage Methods Computation of LASSO
R Code: CV.LARS
attach(diabetes)
lasso.s
-
Shrinkage Methods Consistency Properties of LASSO
Consistency - Definition
Let the true parameters be β0 and their estimates by β̂n. The subscript ‘n’means the sample size n.
Estimation consistency
β̂n − β0 →p 0, as n→∞.
Model selection consistency
P({j : β̂j 6= 0} = {j : β0j 6= 0}
)→ 1, as n→∞.
Sign consistency
P(β̂n =s β0
)→ 1, as n→∞,
whereβ̂n =s β0 ⇔ sign(β̂n) = sign(β0).
Sign consistency is stronger than model selection consistency.Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 30 / 45
-
Shrinkage Methods Consistency Properties of LASSO
Consistency of LASSO Estimator
Knight and Fu (2000) have shown that
Estimation Consistency: The LASSO solution is of estimationconsistency for fixed p. In other words,
β̂lasso(λn)→p β, as λn = o(n).
It is root-n consistent and asymptotically normal.
Model Selection Property: For λn ∝ n12 , as n→∞, there is a
non-vanishing positive probability for lasso to select the true model.
The l1 penalization approach is called basis pursuit in signal processing(Chen, Donoho, and Saunders, 2001).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 31 / 45
-
Shrinkage Methods Consistency Properties of LASSO
Model Selection Consistency of LASSO Estimator
Zhao and Yu (2006). On Model Selection Consistency of Lasso. JMLR, 7,2541–2563.
If an irrelevant predictor is highly correlated with the predictors in thetrue model, Lasso may not be able to distinguish it from the truepredictors with any amount of data and any amount of regularization.
Under Irrepresentable Condition (IC), the LASSO is modelselection consistent in both fixed and large p settings.
The IC condition is represented as
|[Xn(1)TXn(1)]−1XTn (1)Xn(2)| < 1− η,
i.e., the regression coefficients of irrevelant covariates Xn(2) on therelevant covariates Xn(1) is constrained. It is almost necessary andsufficient for Lasso to select the true model.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 32 / 45
-
Shrinkage Methods Consistency Properties of LASSO
Special Cases for LASSO to be Selection Consistent
Assume the true model size |A0| = q.The underlying model must satisfy a nontrivial condition if the lassovariable selection is consistent (Zou, 2006).
The lasso is always consistent in model selection under the followingspecial cases:
when p = 2;when the design is orthgonal;when the covariates have bounded constant correlation, 0 < rn <
11+cq
for some c ;when the design has power decay correlation with corr(Xi ,Xj) = ρ
|i−j|n ,
where |ρn| ≤ c < 1.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 33 / 45
-
Ridge Penalty
Ridge Regression (q = 2)
Applying the squared penalty on the coefficients
minβ
n∑i=1
(yi −d∑
j=1
xijβj)2 + λ
d∑j=1
β2j
λ ≥ 0 is a tuning parameterThe solution
β̂ridge = (XTX + λId)−1X
Ty.
Note the similarity to the ordinary least squares solution, but with theaddition of a “ridge” down the diagonal.
As λ→ 0, β̂ridge → β̂ols.As λ→∞, β̂ridge → 0.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 34 / 45
-
Ridge Penalty
Ridge Regression Solution
In the special case of an orthonormal design matrix,
β̂ridgej =
β̂olsj1 + λ
.
This illustrates the essential “shrinkage” feature of ridge regression
Applying the ridge regression penalty has the effect of shrinking theestimates toward zero
The penalty introducing bias but reducing the variance of the estimate
Ridge estimator does not threshold, since the shrinkage is smooth(proportional to the original coefficient).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 35 / 45
-
Ridge Penalty
Invertibility
If the design matrix X is not full rank, then
XTX is not invertible,
For OLS estimator,, there is no unique solution for β̂ols.
This problem does not occur with ridge regression, however.
Theorem: For any design matrix X , the quantity XTX + λId is always
invertible; thus, there is always a unique solution β̂ridge.
For the ridge estimator, an unbiased estimate for its df is
d̂f (β̂ridge) = Tr{X (XTX + λId)−1XT}.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 36 / 45
-
Ridge Penalty
Bias and Variance
The variance of the ridge regression estimate is
Var(β̂ridge) = σ2WXTXW ,
where W = (XTX + λId)−1.
The bias of the ridge regression estimate is
Bias(β̂ridge) = −λWβ.
It can be shown that
the total variance∑d
j=1 Var(β̂ridgej ) is a monotone decreasing
sequence with respect to λ
while the total squared bias∑d
j=1 Bias2(β̂
ridgej ) is a monotone
increasing sequence with respect to λ.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 37 / 45
-
Ridge Penalty
Mean Squared Error (MSE) of Ridge
Existence Theorem:There always exists a λ such that the MSE of β̂ridge is less than the MSE
of β̂ols.
a rather surprising result with somewhat radical implications: even ifthe model we fit is exactly correct and follows the exact distributionwe specify, we can always obtain a better (in terms of MSE)estimator by shrinking towards zero.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 38 / 45
-
Ridge Penalty
Bayesian Interpretation of Ridge Estimation
Penalized regression can be interpreted in a Bayesian context:
Suppose y = Xβ + �, where � ∼ N(0, σ2In).Given the prior β ∼ N(0, τ2Id), then the posterior mean of β giventhe data is
(XTX +σ2
τ2Id)−1X
Ty.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 39 / 45
-
Ridge Penalty
R package: lm.ridge
library(MASS)
lm.ridge(formula, data, subset, na.action, lambda)
formula: a formula expression as for regression models, of form
response ∼ predictorsdata: an optional data frame in which to interpret the variablesoccurring in formula.
subset: expression saying which subset of the rows of the data shouldbe used in the fit. All observations are included by default.
na.action a function to filter missing data.
lambda: A scalar or vector of ridge constants.
If an intercept is present, its coefficient is not penalized.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 40 / 45
-
Ridge Penalty
Output for lm.ridge
Returns a list with components
coef: matrix of coefficients, one row for each value of lambda. Notethat these are not on the original scale and are for use by the coefmethod.
scales: scalings used on the X matrix.
Inter: was intercept included?
lambda: vector of lambda values
ym: mean of y
xm: column means of x matrix
GCV: vector of GCV values
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 41 / 45
-
Ridge Penalty
Example for Ridge Regression
library(MASS)
longley
names(longley)[1]
-
Ridge Penalty
Elements of Statistial Learning
Hastie, Tibshirani & Friedman 2001 Chapter 3Co
efficie
nts
0 2 4 6 8
-0.2
0.00.2
0.40.6
•
••••
••
••
••
••
••
••
••
••
•••
•
lcavol
••••••••••••••••••••••••
•
lweight
•••••
••••
••••
••••••••••••
age
•••••••••••••••••••••••••
lbph
••••••••••••••••••••••••
•
svi
•
•••
••
••
••
••
••••••••••••
•
lcp
••••••
••••••••••••••••••
•gleason
•
•••••••••••••••••••••••
•
pgg45
PSfrag replaements df(�)Figure 3.7: Pro�les of ridge oeÆients for theprostate aner example, as tuning parameter � is var-ied. CoeÆients are plotted versus df(�), the e�e-tive degrees of freedom. A vertial line is drawn atdf = 4:16, the value hosen by ross-validation.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 43 / 45
-
Ridge Penalty Ridge vs LASSO
Lasso vs Ridge
Ridge regression:
is a continuous shrinkage method; achieves better predictionperformance than OLS through a biase-variance trade-off.
cannot produce a parsimonious model, for it always keeps all thepredictors in the model.
The lasso
does both continuous shrinkage and automatic variable selectionsimultaneously.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 44 / 45
-
Ridge Penalty Ridge vs LASSO
Empirical Comparison
Tibshirani (1996) and Fu (1998) compared the prediction performance ofthe lasso, ridge and bridge regression (Frank and Friedman, 1993) andfound
None of them uniformly dominates the other two.
However, as variable selection becomes increasingly important inmodern data analysis, the lasso is much more appealing owing to itssparse representation.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 45 / 45
Oracle PropertiesOracle Properties
Shrinkage MethodsLASSOParameter Tuning of LASSOComputation of LASSOConsistency Properties of LASSO
Ridge Penalty