lecture 4. linear models for regression. outline linear regression least square solution subset...
TRANSCRIPT
Lecture 4. Linear Models for Regression
Outline
Linear Regression
Least Square Solution
Subset Least Squaresubset selection/forward/backward
Penalized Least Square:Ridge RegressionLASSO Elastic Nets (LASSO+Ridge)
Linear Methods for Regression
Input (FEATURES) Vector: (p-dimensional)X = X1, X2, …, Xp
Real Valued OUTPUT: YJoint Distribution of (Y,X )
Function:
Regression Function E(Y |X ) = f(X)
Training Data :
(x1, y1), (x2, y2), …, (xN, yN) for estimation of input-output relation f.
Linear Model
f(x): Regression function or a good approximation
LINEAR in Unknown Parameters(weights, coefficients)
∑=
+=p
jjjXf
10)( ββX
pβββ ,...,, 10
Features
Quantitative inputs Any arbitrary but known function of measured attributes
Transformations of quantitative attributes: g(x), e.g., log, square, square-root etc.
Basis expansions: e.g., a polynomial approximation of f as a function of X1 (Taylor Series expansion with unknown coefficients)
,,..., 1313
212
kk XXXXXX ===
Features (Cont.)
Qualitative (categorical) input GDummy Codes: For an attribute with k categories, may use k codes j = 1,2, …, k, as indicators of the category (level) used. Together, this collection of inputs represents the effect of G through
This is a set of level-dependent constants, since only one of the Xj equals one and others are zero
∑=
k
jjjX
1
β
Features(cont)
Interactions: 2nd or higher-order Interactions of some features, e.g.,
Feature vector for the ith case in training set (Example)
3214213 **,* XXXXXXX ==
€
x i = (x i1,x i2,..., x ip )T
Generalized Linear Models: Basis Expansion
Wide variety of flexible models
Model for f is a linear expansion of basis functions
Dictionary: Prescribed basis functions
1
( ) ( )K
k kk
f x h xθ θ=
=∑
Other Basis Functions
Polynomial basis of degree s (Smooth functions Cs).
Fourie Series (Band-limited functions, a compact subspace of C∞)
SplinesPiecewise polynomials of degree K between the knots, joined with continuity of degree K-1 at the knots (Sobolev Spaces).
Wavelets (Besov Spaces)
Radial Basis Functions: Symmetric p-dim kernels located at particular centroids f(|x-y|)
Gaussian Kernel at each centroids
And more …
-- Curse of Dimensionality: p could be equal to or much larger than n.
Method of Least Squares
Find coefficients that minimize
Residual Sum of Squares, RSS(b) =
RSS denotes the empirical risk over the training set. It doesn’t assure the predictive performance over all inputs of interest.
Tp ),...,,( 10 ββββ =
∑ ∑∑= ==
−−=−N
i
p
jjiji
N
iii xyxfy
1 1
20
1
2 )())(( ββ
Min RSS Criterion
Statistically Reasonable provided Examples in the Training Set
Large # of independent random draws from the inputs population for which prediction is desirable.
Given inputs (x1, x2, …, xN), the outputs (y1, y2, …, yN) conditionally independent
In principle, predictive performance over the set of future input vectors should be examined.
Gaussian Noise: the Least Squares method equivalent to Max Likelihood
Min RSS(b) over b in R(p+1), a quadratic function of . b
Optimal Solution: Take the derivatives with respect to elements of , b and set them equal to zero.
Optimal Solution
The Hession (2nd derivative) of the criterion function is given by XTX.
The optimal solution satisfies the normal equations
XT(Y-X b) = 0 or (XTX) = b XTY
For an unique solution, the matrix XTX must full rank.
YXXX TT 1)(ˆ −=β1 1ˆ ( ) , ( )T T T TY X X X X Y HY H X X X X− −= = =
Projection
When the matrix XTX is full rank. the estimated response for the training set:
H: Projection (Hat) matrix
HY: Orthogonal Projection of Y on the space spanned by the columns of X
Note: the projection is linear in Y
1 1ˆ ( ) , ( )T T T TY X X X X Y HY H X X X X− −= = =
Geometrical Insight
Simple Univariate Regression
One Variable with no intercept
LS estimate
inner product
= cosine (angle between vectors x and y), a measure of similarity between y and x
Residuals: projection on normal space
Definition: “Regress b on a”
Simple regression of response b and input a, with no intercept
Estimate
Residual “b adjusted for a”
“b orthogonalized with respect to a”
Y X β ε= +
1
2
1
,ˆ,
N
i i
N
i
x y x y
x xxβ = =∑
∑
1,
N Ti ix y x y x y= =∑
ˆr y xβ= −
ˆ , / ,a b a aγ =
ˆb aγ−
Multiple Regression
Multiple Regression:p>1
LS estimates different from simple univariate regression estimates, unless columns of input matrix X orthogonal,
If
then These estimates are uncorrelated, and
Orthogonal inputs occur sometimes in balanced, designed experiments (experimental design).
Observational studies, will almost never have orthogonal inputs
Must “orthogonalize” them in order to have similar interpretationUse Gram-Schmidt procedure to obtain an orthogonal basis for multiple regression
, 0, for all i jx x i j= ≠
ˆ , / ,j j j jx y x xβ =
2ˆ( ) / ,p p pVar x xβ σ= < >
Multiple Regression Estimates: Sequence of Simple
Regressions Regression by Successive Orthogonalization:
Initialize For, j=1, 2, …, p, Regress
to produce coefficients
and residual vectors
Regress y on the residual vector for the estimate
0 1 1 on , , ,j jx z z z −L
ˆ , / ,lj l j l lz x z zγ =
11ˆ
j
j j kj kkz x zγ −=
= −∑pz
0 0 1z x= =
,ˆ,
pp
p p
z y
z zβ
< >=< >
Instead of using x1 and x2, take x1 and z as features
Multiple Regression = Gram-Schmidt Orthogonalization
ProcedureThe vector zp is the residual of the multiple regression of xp on all other inputs
Successive z’s in the above algorithm are orthogonal and form an orthogonal basis for the columns space of X.
The least squares projection onto this subspace is the usual
By re-arranging the order of these variables, any input can be labeled as the pth variable.
If is highly correlated with other variables, the residuals are quite small, and the coefficient has high variance.
y
0 1 1, , , px x x −L
jx
Statistics Properties of LS
Model
Uncorrelated noise: Mean zero, Variance
Then
Noise estimation
Model d.f. = p+1 (dimension of the model space)
To Draw inferences on parameter estimates, we need assumptions on noise:
If assume:
then,
2σ1 2( ) ( ' )Var X Xβ σ−=
)
2 2
1
1ˆ( )
1
N
i ii
y yN p
σ=
= −− − ∑)
1 2ˆ ~ ( , ( ) )TNβ β σ
2 2 21ˆ( 1) ~ N pN p σ σ
Gauss-Markov Theorem
(The Gauss-Markov Theorem) If we have any linear estimator that is unbiased for aT β, that is, E(cT y)= aT β,then
It says that, for inputs in row space of X, LS estimate have Minimum variance among all unbiased estimates.
ˆvar( ) var( )T Ta c yβ
Bias-Variance Tradeoff
Mean square error of an estimator = variance + bias
Least square estimator achieves the minimal variance among all unbiased estimators
There are biased estimators to further reduce variance: Stein’s estimator, Shrinkage/Thresholding (LASSO, etc.)
More complicated a model is, more variance but less bias, need a trade-off
Hypothesis Test
• Single Parameter test: βj=0, T-statistics
where vj is the j-th diagonal element of V = (XTX)-1
Confidence interval , e.g. z1-0.0025=1.96
• Group parameter: , F-statistics for nested models
1
ˆ~
ˆj
j N p
j
z tv
βσ
0 1 1 0
1
( ) ( )
( 1)
RSS RSS p pF
RSS N p
€
βΩ =0, Ω = p1 − p0( )
€
ˆ β ± z1−α v j1/ 2 ˆ σ
Example
R command: lm(Y ~ x1 + x2 + … +xp)
Rank Deficiency
X : rank deficient Normal equations has infinitely many solutions
Hat matrix H, and the projection are unique.
For an input in the row space of X, unique LS estimate.
For an input, not in the row space of X, the estimate may change with the solution used. How to generalize to inputs outside the training set?
Penalized methods (!)
Y X β= %
β%
Reasons for Alternatives to LS Estimates
Prediction accuracyLS estimates have low bias but high variance when inputs are highly correlated
Larger ESPE
Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero.Small bias in estimates may yield a large decrease in variance
Bias/var tradeoff may provide better predictive ability
Better interpretationWith a large number of input variables, like to determine a smaller subset that exhibit the strongest effects.
Many tools to achieve these objectives
Subset selection
Penalized regression -constrained optimization
Best Subset Selection Method
Algorithm: leaps & boundsFind the best subset corresponding to the smallest RSS for each size
For each fixed size k, can also find a specified number of subsets close to the bestFor each fixed subset, obtain LS estimatesFeasible for p ~ 40.
Choice of optimal k based on model selection criteria to be discussed later
{0,1,2, , }k p∈ L
Other Subset Selection Procedures
Larger p, Classical Forward selection (step-up),
Backward elimination (step down)
Hybrid forward-backward (step-wise) methodsGiven a model, these methods only provide local controls for variable selection or deletion
Which current variable is least effective (candidate for deletion)
Which variable not in the model is most effective (candidate for inclusion)
Do not attempt to find the best subset of a given size
Not too popular in current practice
Forward Stagewise Selection
(Incremental) Forward stagewiseStandardize the input variables
Note:
Penalized Regression
• Instead of directly minimize the Residual Sum Square,
The penalized regression usually take the form:
where J(f) is the penalization term, usually penalize on
the smoothness or complexity of the function f
λ is chosen by cross-validation.
Model Assessment and Selection
If we are in data-rich situation, split data into three parts: training, validation, and testing.
Train Validation Test
See chapter 7.1 for details
Cross Validation
When sample size not sufficiently large, Cross Validation is a way to estimate the out of sample estimation error (or classification rate).
Available Data Training Test
Randomly split
Split many times and geterror2, …, errorm ,then averageover all error to get an estimate
error1
Ridge Regression (Tikhonov Regularization)
Ridge regression shrinks coefficients by imposing a penalty on their size
Min a penalized RSS
Here is complexity parameter, that controls the amount of shrinkage
Larger its value, greater the amount of shrinkage
Coefficients are shrunk towards zero
Choice of penalty term based on cross validation
2
1ˆ arg min { }
pridgejj
RSSββ λ β=
= + ∑0λ ≥
Prostate Cancer Example
Ridge Regression (cont)
Equivalent problemMin RSS subject to
Lagrangian multiplier 1-1 correspondence between s and
With many correlated variables, LS estimates can become unstable and exhibit high variance and high correlations
A widely large positive coeff on one variable can be cancelled by a large negative coeff on another
Imposing a size constraint on the coefficients, this phenomena is prevented from occurring
Ridge solutions are not invariant under scaling of inputs
Normally standardize the inputs before solving the optimization problem
Since the penalty term does not include the bias term, estimate intercept by the mean of response y.
2
1
p
jjsβ
=≤∑
λ
λ
Ridge Regression (cont)
The Ridge criterion
Shrinkage:For orthogonal inputs, ridge: scaled version of LS estimates
Ridge is mean or mode of posterior distribution of under a normal prior
Centered input matrix XSVD of X:
U and V are orthogonal matrices
Columns of U span column space of X
Columns of V span row space of X
D: a diagonal matrix of singular values
Eigen decomposition of
The Eigen vectors : principal components directions of X (Karhunen-Loeve direction)
( ) ( ) ( )T TRSS y X y Xλ β β λβ β= − − +
-1ˆ ( X+ I)ridge T TX X Yβ λ=
ˆ ˆ,0 1ridgeβ γβ γ= ≤ ≤
β
TX UDV=
1 2 0pd d d≥ ≥ ≥ ≥L2T TX X VD V=jv
Ridge Regression and Principal Components
First PC direction :Among all normalized linear combinations of columns of X, the has largest sample variance Derived variable, is first PC of X.
Subsequent PC have max variance subject to being orthogonal to earlier ones. Last PC has min variance
Effective Degree of Freedmon
1v
1 1 1 1z Xv u d= =
1z
jz
Ridge Regression (Summary)
Ridge Regression penalized the complexity of a linear model by the sum squares of the coefficients
It is equivalent to minimize RRS given the constraints
The matrix (XTX+ I) is always invertable.
The penalization parameter controls how simple “you” want the model to be.
Prostate Cancer Example
Ridge Regression (Summary)
Solutions are not sparse in the coefficient space.
- ’s are not zero
almost all the time.
The computation complexity is O(p3) when inversing the matrix XTX+ I.
Prostate Cancer Example
Least Absolute Shrinkage and Selection Operator (LASSO)
Penalized RSS with L1-norm penalty, or subject to constraint
Shrinks like Ridge with L2-norm penalty, but LASSO coefficients hit zero, as the penalty increases.
1
p
jjtβ
=≤∑
LASSO as Penalized Regression
• Instead of directly minimize the Residual Sum Square,
The penalized regression usually take the form:
where
€
J( f ) = f1:= βi
i=1
p
∑
€
f (X) = β0 + X jβ jj =1
p
∑
LASSO(cont)
The computation is a quadratic programming problem.
We can obtain the solution path, piece-wise linear.
Coefficients are non-linear in response y (they are linear in y in Ridge Regression)
Regularization parameter is chosen by cross validation.
LASSO and RidgeContour of RRS in the space of ’s
Generalize to L-q norm as penalty
Minimize RSS subject to constraints on the l-q norm
Equivalent to Min
Bridge regression with ridge and LASSO as special cases (q=1, smallest value for convex region)
For q=0, best subset regression
For 0<q<1, it is not convex!
1
qp
jjRSS λ β
=+ ∑
Contours of constant values of L-q norms
Why non-convex norms?
LASSO is biased:
Nonconvex Penalty is necessary for unbiased estimator
€
1
2y − β( )
2+ λ β
€
ˆ β λ = Sλ (y) = sign(y)max( y − λ ,0)
€
E( ˆ β ) ≠ E(y), y >> 0
€
1
2y − β( )
2+ λJ(β)
€
∂J(β) = (y − β) /λ →0, y →∞
Elastic Net as a compromise between Ridge and LASSO
(Zou and Hastie 2005)
The Group LASSOGroup LASSO
Group norm l1-l2 (also l1-l∞)
Every group of variable are simultaneously selected or dropped
Methods using Derived Directions
Principal Components Regression
Partial Least Squares
Principal Components Regression
Principal Components Regression
(M<p)
Motivation: leading
eigen-vectors
describe most of
the variability in X
X2
X1
Z2
Z1
Principal Components Regression
Zi and Zj are orthogonal now.
The dimension is reduced.
High correlation between independent variables are eliminated.
Noises in X’s are taken off (hopefully).
Computation: PCA + Regression
Partial Least Square
Partial Least Squares (PLS)
Uses inputs as well as response y to form the directions Zm
Seeks directions that have high variance and have high correlation with response yPopular in ChemometricsIf original inputs are orthogonal, finds the LS after one step. Subsequent steps have no effect.
Since the derived inputs use y, the estimates are non-linear functions of the response, when the inputs are not orthogonal
The coefficients for original variables tend to shrink as fewer PLS directions are used
Choice of M can be made via cross validation
Partial Least Square Algorithm
PCR vs. PLS
Principal Component Regression choose directions:
Partial Least Square has m-th direction:
variance tends to dominate, whence PLS is close to Ridge
Ridge, PCR and PLS
The solution path of different methods in a two variable (corr(X1,X2)=ρ, β=(4,2))Regression case.
Comparisons on Estimated Prediction Errors (Prostate
Cancer Example)
0.574 (0.156)
0.540 (0.168)
0.636 (0.172)
0.491 (0.152)
0.527 (0.122) Least Square:
Test error: 0.586 Sd of error:(0.184)
LASSO and Forward Stagewise
Diabetes Data
LASSO and Forward Stagewise
Least Angel Regression (LARS)
Efron, Hastie, Johnstone, and Tibshirani (2003)
Recall: Forward Stagewise Selection
(Incremental) Forward stagewiseStandardize the input variables
Note:
LAR directions and Example
y
Relationship between those three
• Lasso and forward stagewise can be thought of as restricted version of LARS
• For Lasso: Start with LARS; If a coefficient crosses zero, stop. Drop that predictor, recompute the best direction and continue. It gives the LASSO path.
• For Stagewise, Start with LARS; select the most correlated direction at each stage, go that direction with a small step.
• There are other related methods:• Orthogonal Matching Pursuit• Linearized Bregman Iteration
Homework Project I
Keyword Pricing (regression)
Homework Project IIClick Prediction (classification): two subproblems
click/impression
click/bidding
Data Directory: /data/ipinyou/
Files:bid.20130301.txt: Bidding log file, 1.2M rows, 470MB
imp.20130301.txt: Impression log, 0.8M rows, 360MB
clk.20130301.txt: Click log file, 796 rows, 330KB
data.zip: compressed files above (Password: ipinyou2013)
dsp_bidding_data_format.pdf: format file
Region&citys.txt: Region and City code
Questions: [email protected]
Homework Project II
Data Input by R:bid <- read.table("/Users/Liaohairen/DSP/bid.20130301.txt", sep='\t', comment.char='')
imp <- read.table("/Users/Liaohairen/DSP/imp.20130301.txt", sep='\t', comment.char='’)
R read.table by default uses '#' ascomment character, that is , it has the comment.char = '#' parameter, but the user-agent field may have '#' character. To read correctly, turning off of interpretation of comments by setting comment.char='' is needed.
Homework Project III
Heart Operation Effect Prediction (classification)
Note: Large amount missing values