lecture 4. linear models for regression. outline linear regression least square solution subset...

Lecture 4. Linear Models for Regression

Outline

Linear Regression

Least Square Solution

Subset Least Squaresubset selection/forward/backward

Penalized Least Square:Ridge RegressionLASSO Elastic Nets (LASSO+Ridge)

Linear Methods for Regression

Input (FEATURES) Vector: (p-dimensional)X = X1, X2, …, Xp

Real Valued OUTPUT: YJoint Distribution of (Y,X )

Function:

Regression Function E(Y |X ) = f(X)

Training Data :

(x1, y1), (x2, y2), …, (xN, yN) for estimation of input-output relation f.

Linear Model

f(x): Regression function or a good approximation

LINEAR in Unknown Parameters(weights, coefficients)

∑=

+=p

jjjXf

10)( ββX

pβββ ,...,, 10

Features

Quantitative inputs Any arbitrary but known function of measured attributes

Transformations of quantitative attributes: g(x), e.g., log, square, square-root etc.

Basis expansions: e.g., a polynomial approximation of f as a function of X1 (Taylor Series expansion with unknown coefficients)

,,..., 1313

212

kk XXXXXX ===

Features (Cont.)

Qualitative (categorical) input GDummy Codes: For an attribute with k categories, may use k codes j = 1,2, …, k, as indicators of the category (level) used. Together, this collection of inputs represents the effect of G through

This is a set of level-dependent constants, since only one of the Xj equals one and others are zero

∑=

k

jjjX

1

β

Features(cont)

Interactions: 2nd or higher-order Interactions of some features, e.g.,

Feature vector for the ith case in training set (Example)

3214213 **,* XXXXXXX ==

€

x i = (x i1,x i2,..., x ip )T

Generalized Linear Models: Basis Expansion

Wide variety of flexible models

Model for f is a linear expansion of basis functions

Dictionary: Prescribed basis functions

1

( ) ( )K

k kk

f x h xθ θ=

=∑

Other Basis Functions

Polynomial basis of degree s (Smooth functions Cs).

Fourie Series (Band-limited functions, a compact subspace of C∞)

SplinesPiecewise polynomials of degree K between the knots, joined with continuity of degree K-1 at the knots (Sobolev Spaces).

Wavelets (Besov Spaces)

Radial Basis Functions: Symmetric p-dim kernels located at particular centroids f(|x-y|)

Gaussian Kernel at each centroids

And more …

-- Curse of Dimensionality: p could be equal to or much larger than n.

Method of Least Squares

Find coefficients that minimize

Residual Sum of Squares, RSS(b) =

RSS denotes the empirical risk over the training set. It doesn’t assure the predictive performance over all inputs of interest.

Tp ),...,,( 10 ββββ =

∑ ∑∑= ==

−−=−N

i

p

jjiji

N

iii xyxfy

1 1

20

1

2 )())(( ββ

Min RSS Criterion

Statistically Reasonable provided Examples in the Training Set

Large # of independent random draws from the inputs population for which prediction is desirable.

Given inputs (x1, x2, …, xN), the outputs (y1, y2, …, yN) conditionally independent

In principle, predictive performance over the set of future input vectors should be examined.

Gaussian Noise: the Least Squares method equivalent to Max Likelihood

Min RSS(b) over b in R(p+1), a quadratic function of . b

Optimal Solution: Take the derivatives with respect to elements of , b and set them equal to zero.

Optimal Solution

The Hession (2nd derivative) of the criterion function is given by XTX.

The optimal solution satisfies the normal equations

XT(Y-X b) = 0 or (XTX) = b XTY

For an unique solution, the matrix XTX must full rank.

YXXX TT 1)(ˆ −=β1 1ˆ ( ) , ( )T T T TY X X X X Y HY H X X X X− −= = =

Projection

When the matrix XTX is full rank. the estimated response for the training set:

H: Projection (Hat) matrix

HY: Orthogonal Projection of Y on the space spanned by the columns of X

Note: the projection is linear in Y

1 1ˆ ( ) , ( )T T T TY X X X X Y HY H X X X X− −= = =

Geometrical Insight

Simple Univariate Regression

One Variable with no intercept

LS estimate

inner product

= cosine (angle between vectors x and y), a measure of similarity between y and x

Residuals: projection on normal space

Definition: “Regress b on a”

Simple regression of response b and input a, with no intercept

Estimate

Residual “b adjusted for a”

“b orthogonalized with respect to a”

Y X β ε= +

1

2

1

,ˆ,

N

i i

N

i

x y x y

x xxβ = =∑

∑

1,

N Ti ix y x y x y= =∑

ˆr y xβ= −

ˆ , / ,a b a aγ =

ˆb aγ−

Multiple Regression

Multiple Regression:p>1

LS estimates different from simple univariate regression estimates, unless columns of input matrix X orthogonal,

If

then These estimates are uncorrelated, and

Orthogonal inputs occur sometimes in balanced, designed experiments (experimental design).

Observational studies, will almost never have orthogonal inputs

Must “orthogonalize” them in order to have similar interpretationUse Gram-Schmidt procedure to obtain an orthogonal basis for multiple regression

, 0, for all i jx x i j= ≠

ˆ , / ,j j j jx y x xβ =

2ˆ( ) / ,p p pVar x xβ σ= < >

Multiple Regression Estimates: Sequence of Simple

Regressions Regression by Successive Orthogonalization:

Initialize For, j=1, 2, …, p, Regress

to produce coefficients

and residual vectors

Regress y on the residual vector for the estimate

0 1 1 on , , ,j jx z z z −L

ˆ , / ,lj l j l lz x z zγ =

11ˆ

j

j j kj kkz x zγ −=

= −∑pz

0 0 1z x= =

,ˆ,

pp

p p

z y

z zβ

< >=< >

Instead of using x1 and x2, take x1 and z as features

Multiple Regression = Gram-Schmidt Orthogonalization

ProcedureThe vector zp is the residual of the multiple regression of xp on all other inputs

Successive z’s in the above algorithm are orthogonal and form an orthogonal basis for the columns space of X.

The least squares projection onto this subspace is the usual

By re-arranging the order of these variables, any input can be labeled as the pth variable.

If is highly correlated with other variables, the residuals are quite small, and the coefficient has high variance.

y

0 1 1, , , px x x −L

jx

Statistics Properties of LS

Model

Uncorrelated noise: Mean zero, Variance

Then

Noise estimation

Model d.f. = p+1 (dimension of the model space)

To Draw inferences on parameter estimates, we need assumptions on noise:

If assume:

then,

2σ1 2( ) ( ' )Var X Xβ σ−=

)

2 2

1

1ˆ( )

1

N

i ii

y yN p

σ=

= −− − ∑)

1 2ˆ ~ ( , ( ) )TNβ β σ

2 2 21ˆ( 1) ~ N pN p σ σ

Gauss-Markov Theorem

(The Gauss-Markov Theorem) If we have any linear estimator that is unbiased for aT β, that is, E(cT y)= aT β,then

It says that, for inputs in row space of X, LS estimate have Minimum variance among all unbiased estimates.

ˆvar( ) var( )T Ta c yβ

Bias-Variance Tradeoff

Mean square error of an estimator = variance + bias

Least square estimator achieves the minimal variance among all unbiased estimators

There are biased estimators to further reduce variance: Stein’s estimator, Shrinkage/Thresholding (LASSO, etc.)

More complicated a model is, more variance but less bias, need a trade-off

Hypothesis Test

• Single Parameter test: βj=0, T-statistics

where vj is the j-th diagonal element of V = (XTX)-1

Confidence interval , e.g. z1-0.0025=1.96

• Group parameter: , F-statistics for nested models

1

ˆ~

ˆj

j N p

j

z tv

βσ

0 1 1 0

1

( ) ( )

( 1)

RSS RSS p pF

RSS N p

€

βΩ =0, Ω = p1 − p0( )

€

ˆ β ± z1−α v j1/ 2 ˆ σ

Example

R command: lm(Y ~ x1 + x2 + … +xp)

Rank Deficiency

X : rank deficient Normal equations has infinitely many solutions

Hat matrix H, and the projection are unique.

For an input in the row space of X, unique LS estimate.

For an input, not in the row space of X, the estimate may change with the solution used. How to generalize to inputs outside the training set?

Penalized methods (!)

Y X β= %

β%

Reasons for Alternatives to LS Estimates

Prediction accuracyLS estimates have low bias but high variance when inputs are highly correlated

Larger ESPE

Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero.Small bias in estimates may yield a large decrease in variance

Bias/var tradeoff may provide better predictive ability

Better interpretationWith a large number of input variables, like to determine a smaller subset that exhibit the strongest effects.

Many tools to achieve these objectives

Subset selection

Penalized regression -constrained optimization

Best Subset Selection Method

Algorithm: leaps & boundsFind the best subset corresponding to the smallest RSS for each size

For each fixed size k, can also find a specified number of subsets close to the bestFor each fixed subset, obtain LS estimatesFeasible for p ~ 40.

Choice of optimal k based on model selection criteria to be discussed later

{0,1,2, , }k p∈ L

Other Subset Selection Procedures

Larger p, Classical Forward selection (step-up),

Backward elimination (step down)

Hybrid forward-backward (step-wise) methodsGiven a model, these methods only provide local controls for variable selection or deletion

Which current variable is least effective (candidate for deletion)

Which variable not in the model is most effective (candidate for inclusion)

Do not attempt to find the best subset of a given size

Not too popular in current practice

Forward Stagewise Selection

(Incremental) Forward stagewiseStandardize the input variables

Note:

Penalized Regression

• Instead of directly minimize the Residual Sum Square,

The penalized regression usually take the form:

where J(f) is the penalization term, usually penalize on

the smoothness or complexity of the function f

λ is chosen by cross-validation.

Model Assessment and Selection

If we are in data-rich situation, split data into three parts: training, validation, and testing.

Train Validation Test

See chapter 7.1 for details

Cross Validation

When sample size not sufficiently large, Cross Validation is a way to estimate the out of sample estimation error (or classification rate).

Available Data Training Test

Randomly split

Split many times and geterror2, …, errorm ,then averageover all error to get an estimate

error1

Ridge Regression (Tikhonov Regularization)

Ridge regression shrinks coefficients by imposing a penalty on their size

Min a penalized RSS

Here is complexity parameter, that controls the amount of shrinkage

Larger its value, greater the amount of shrinkage

Coefficients are shrunk towards zero

Choice of penalty term based on cross validation

2

1ˆ arg min { }

pridgejj

RSSββ λ β=

= + ∑0λ ≥

Prostate Cancer Example

Ridge Regression (cont)

Equivalent problemMin RSS subject to

Lagrangian multiplier 1-1 correspondence between s and

With many correlated variables, LS estimates can become unstable and exhibit high variance and high correlations

A widely large positive coeff on one variable can be cancelled by a large negative coeff on another

Imposing a size constraint on the coefficients, this phenomena is prevented from occurring

Ridge solutions are not invariant under scaling of inputs

Normally standardize the inputs before solving the optimization problem

Since the penalty term does not include the bias term, estimate intercept by the mean of response y.

2

1

p

jjsβ

=≤∑

λ

λ

Ridge Regression (cont)

The Ridge criterion

Shrinkage:For orthogonal inputs, ridge: scaled version of LS estimates

Ridge is mean or mode of posterior distribution of under a normal prior

Centered input matrix XSVD of X:

U and V are orthogonal matrices

Columns of U span column space of X

Columns of V span row space of X

D: a diagonal matrix of singular values

Eigen decomposition of

The Eigen vectors : principal components directions of X (Karhunen-Loeve direction)

( ) ( ) ( )T TRSS y X y Xλ β β λβ β= − − +

-1ˆ ( X+ I)ridge T TX X Yβ λ=

ˆ ˆ,0 1ridgeβ γβ γ= ≤ ≤

β

TX UDV=

1 2 0pd d d≥ ≥ ≥ ≥L2T TX X VD V=jv

Ridge Regression and Principal Components

First PC direction :Among all normalized linear combinations of columns of X, the has largest sample variance Derived variable, is first PC of X.

Subsequent PC have max variance subject to being orthogonal to earlier ones. Last PC has min variance

Effective Degree of Freedmon

1v

1 1 1 1z Xv u d= =

1z

jz

Ridge Regression (Summary)

Ridge Regression penalized the complexity of a linear model by the sum squares of the coefficients

It is equivalent to minimize RRS given the constraints

The matrix (XTX+ I) is always invertable.

The penalization parameter controls how simple “you” want the model to be.

Ridge Regression (Summary)

Solutions are not sparse in the coefficient space.

- ’s are not zero

almost all the time.

The computation complexity is O(p3) when inversing the matrix XTX+ I.


Least Absolute Shrinkage and Selection Operator (LASSO)

Penalized RSS with L1-norm penalty, or subject to constraint

Shrinks like Ridge with L2-norm penalty, but LASSO coefficients hit zero, as the penalty increases.

1

p

jjtβ

=≤∑

LASSO as Penalized Regression

• Instead of directly minimize the Residual Sum Square,

The penalized regression usually take the form:

where

€

J( f ) = f1:= βi

i=1

p

∑

€

f (X) = β0 + X jβ jj =1

p

∑

LASSO(cont)

The computation is a quadratic programming problem.

We can obtain the solution path, piece-wise linear.

Coefficients are non-linear in response y (they are linear in y in Ridge Regression)

Regularization parameter is chosen by cross validation.

LASSO and RidgeContour of RRS in the space of ’s

Generalize to L-q norm as penalty

Minimize RSS subject to constraints on the l-q norm

Equivalent to Min

Bridge regression with ridge and LASSO as special cases (q=1, smallest value for convex region)

For q=0, best subset regression

For 0<q<1, it is not convex!

1

qp

jjRSS λ β

=+ ∑

Contours of constant values of L-q norms

Why non-convex norms?

LASSO is biased:

Nonconvex Penalty is necessary for unbiased estimator

€

1

2y − β( )

2+ λ β

€

ˆ β λ = Sλ (y) = sign(y)max( y − λ ,0)

€

E( ˆ β ) ≠ E(y), y >> 0

€

1

2y − β( )

2+ λJ(β)

€

∂J(β) = (y − β) /λ →0, y →∞

Elastic Net as a compromise between Ridge and LASSO

(Zou and Hastie 2005)

The Group LASSOGroup LASSO

Group norm l1-l2 (also l1-l∞)

Every group of variable are simultaneously selected or dropped

Methods using Derived Directions

Principal Components Regression

Partial Least Squares



(M<p)

Motivation: leading

eigen-vectors

describe most of

the variability in X

X2

X1

Z2

Z1


Zi and Zj are orthogonal now.

The dimension is reduced.

High correlation between independent variables are eliminated.

Noises in X’s are taken off (hopefully).

Computation: PCA + Regression

Partial Least Square

Partial Least Squares (PLS)

Uses inputs as well as response y to form the directions Zm

Seeks directions that have high variance and have high correlation with response yPopular in ChemometricsIf original inputs are orthogonal, finds the LS after one step. Subsequent steps have no effect.

Since the derived inputs use y, the estimates are non-linear functions of the response, when the inputs are not orthogonal

The coefficients for original variables tend to shrink as fewer PLS directions are used

Choice of M can be made via cross validation

Partial Least Square Algorithm

PCR vs. PLS

Principal Component Regression choose directions:

Partial Least Square has m-th direction:

variance tends to dominate, whence PLS is close to Ridge

Ridge, PCR and PLS

The solution path of different methods in a two variable (corr(X1,X2)=ρ, β=(4,2))Regression case.

Comparisons on Estimated Prediction Errors (Prostate

Cancer Example)

0.574 (0.156)

0.540 (0.168)

0.636 (0.172)

0.491 (0.152)

0.527 (0.122) Least Square:

Test error: 0.586 Sd of error:(0.184)

LASSO and Forward Stagewise

Diabetes Data

LASSO and Forward Stagewise

Least Angel Regression (LARS)

Efron, Hastie, Johnstone, and Tibshirani (2003)

Recall: Forward Stagewise Selection

(Incremental) Forward stagewiseStandardize the input variables

Note:

LAR directions and Example

y

Relationship between those three

• Lasso and forward stagewise can be thought of as restricted version of LARS

• For Lasso: Start with LARS; If a coefficient crosses zero, stop. Drop that predictor, recompute the best direction and continue. It gives the LASSO path.

• For Stagewise, Start with LARS; select the most correlated direction at each stage, go that direction with a small step.

• There are other related methods:• Orthogonal Matching Pursuit• Linearized Bregman Iteration

Homework Project I

Keyword Pricing (regression)

Homework Project IIClick Prediction (classification): two subproblems

click/impression

click/bidding

Data Directory: /data/ipinyou/

Files:bid.20130301.txt: Bidding log file, 1.2M rows, 470MB

imp.20130301.txt: Impression log, 0.8M rows, 360MB

clk.20130301.txt: Click log file, 796 rows, 330KB

data.zip: compressed files above (Password: ipinyou2013)

dsp_bidding_data_format.pdf: format file

Region&citys.txt: Region and City code

Questions: [email protected]

mailto:[email protected]

Homework Project II

Data Input by R:bid <- read.table("/Users/Liaohairen/DSP/bid.20130301.txt", sep='\t', comment.char='')

imp <- read.table("/Users/Liaohairen/DSP/imp.20130301.txt", sep='\t', comment.char='’)

R read.table by default uses '#' ascomment character, that is , it has the comment.char = '#' parameter, but the user-agent field may have '#' character. To read correctly, turning off of interpretation of comments by setting comment.char='' is needed.

Homework Project III

Heart Operation Effect Prediction (classification)

Note: Large amount missing values

lecture 4. linear models for regression. outline linear regression least square solution subset...

Documents

x t x x t y

inputs x

x function

function of x

matrix x t x

vectors x

columns of x

x residuals