3. linear methods for regression
DESCRIPTION
3. Linear Methods for Regression. Contents. Least Squares Regression QR decomposition for Multiple Regression Subset Selection Coefficient Shrinkage. 1. Introduction. Outline The simple linear regression model Multiple linear regression Model selection and shrinkage—the state of the art. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/1.jpg)
3. Linear Methods for Regression
![Page 2: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/2.jpg)
Contents
Least Squares Regression
QR decomposition for Multiple Regression
Subset Selection
Coefficient Shrinkage
![Page 3: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/3.jpg)
1. Introduction
• Outline• The simple linear regression model
• Multiple linear regression
• Model selection and shrinkage—the state of the art
![Page 4: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/4.jpg)
Regression
0
2
4
6
8
10
12
14
16
0 1 2 3 4 5 6 7 8 9 10
X
Y
How can we model the generative process for this data?
![Page 5: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/5.jpg)
Linear Assumption
A linear model assumes the regression function E(Y | X) is reasonably approximated as linear
i.e.
• The regression function f(x) = E(Y | X=x) was the result of minimizing squared expected prediction error
• Making the above assumption has high bias, but low variance
),...,(,)( 21
1
0 p
p
j
jj XXXXXXf
![Page 6: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/6.jpg)
Least Squares Regression
Estimate the parameters based on a set of training data: (x1, y1)…(xN, yN)
Minimize residual sum of squares
• Training samples are random, independent draws• OR, yi’s are conditionally independent given xi
2
1 1
0)(
N
i
p
j
jiji xyRSS
Reasonable criterion when…
![Page 7: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/7.jpg)
Matrix Notation
X is N (p+1) of input vectors
y is the N-vector of outputs (labels)
is the (p+1)-vector of parameters
NpNN
p
p
TN
T
T
xxx
xxxxxx
x
xx
...1...
...1...1
1...
11
21
22221
11211
2
1
X
Ny
yy
y...
2
1
p
...1
0
![Page 8: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/8.jpg)
Perfectly Linear Data
When the data is exactly linear, there exists s.t.
(linear regression model in matrix form)
Usually the data is not an exact fit, so…
Xy
![Page 9: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/9.jpg)
Finding the Best Fit?
-4
0
4
8
12
16
20
0 2 4 6 8 10
X
Y
Fitting Data from Y=1.5X+.35+N(0,1.2)
![Page 10: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/10.jpg)
Minimize the RSS
We can rewrite the RSS in Matrix form
Getting a least squares fit involves minimizing the RSS
Solve for the parameters for which the first derivative of the RSS is zero
XX yyRSS T)(
![Page 11: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/11.jpg)
Solving Least Squares
Derivative of a Quadratic Product
bxexexbxdxd TTTT ACDDCADCA
XX
XXXX
XX
y
yIyI
yIyRSS
T
TN
TN
T
NT
2
y
yy
TT
TT
TT
XXX
XXXXXX
1
0
Then,
Setting the First Derivative to Zero:
![Page 12: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/12.jpg)
Least Squares Solution
YXX)(X T1T β YXX)X(XβXY T1T ˆˆ
1
pN
)ˆ(RSS
•Least Squares Coefficients
•Least Squares Predictions
•Estimated Variance
N
i
ii yypN
ˆ
1
22
1
1
![Page 13: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/13.jpg)
The N-dimensional Geometry of Least Squares Regression
![Page 14: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/14.jpg)
Statistics of Least Squares
We can draw inferences about the parameters, , by assuming the true model is linear with noise, i.e.
Then,
),0(~, 2
1
0 NXYp
j
jj
21,~ˆ
XXTN
)1(χ ~ˆ1 222 pNpN
![Page 15: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/15.jpg)
Significance of One Parameter
Can we eliminate one parameter, Xj (j=0)?
Look at the standardized coefficient
),1(~ˆ
ˆ pNt
vz
j
jj
vj is the jth diagonal element of (XTX)-1
![Page 16: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/16.jpg)
Significance of Many Parameters
We may want to test many features at onceComparing model M1 with p1+1 parameters to
model M0 with p0+1 parameters from M1 (p0<p1)
Use the F statistic:
)1,(~
1 10111
0110
pNppFpNRSS
ppRSSRSSF
![Page 17: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/17.jpg)
Confidence Interval for Beta
We can find a confidence interval for j
Confidence Interval for single parameter (1-2 confidence interval for j )
Confidence Interval for entire parameter (Bounds on )
σvzβ,σvzβ /jαj
/jαj
211
211
121
2p
TTˆˆˆ XX
![Page 18: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/18.jpg)
2.1 : Prostate cancer < Example>
Data• lcavol: log cancer volume• lweight: log prostate weight• age: age• lbph: log of benign prostatic
hyperplasia amount• svi: seminal vesicle invasion• lcp: log of capsular penetration• Gleason: gleason scores• Pgg45: percent Gleason scores 4 or 5
![Page 19: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/19.jpg)
Technique for Multiple Regression
Computing directly has poor numeric properties
QR Decomposition of X Decompose X = QR where
• Q is N (p+1) orthogonal vector (QTQ = I(p+1))
• R is an (p+1) (p+1) upper triangular matrix
Then
yTT XXX1ˆ
yyyyˆ TTTTTTTTTTT QRQRRRQRRRQRQRQR 11111
yy TQQ
11 qx 11r
222122 qqx 1 rr
333223133 qqqx 1 rrr
…
![Page 20: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/20.jpg)
Gram-Schmidt Procedure
1) Initialize z0 = x0 = 12) For j = 1 to p
For k = 0 to j-1, regress xj on the zk’s so that
Then compute the next residual
3) Let Z = [z0 z1 … zp] and be upper triangular with entries kj
X = Z = ZD-1D = QR
where D is diagonal with Djj = || zj ||
kk
jkkj zz
xz
1
0
j
k
kkjjj zxz
(univariate least squares estimates)
![Page 21: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/21.jpg)
Subset Selection
We want to eliminate unnecessary features
Best subset regression• Choose the subset of size k with lowest RSS
• Leaps and Bounds procedure works with p up to 40
Forward Stepwise Selection• Continually add features to with the largest F-ratio
Backward Stepwise Selection• Remove features from with small F-ratio
Greedy techniques – not guaranteed to find the best model
)1,1(~
1
1
11
10
pNF
pNRSS
RSSRSSF
![Page 22: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/22.jpg)
Coefficient Shrinkage
Use additional penalties to reduce coefficients
Ridge Regression• Minimize least squares s.t.
The Lasso• Minimize least squares s.t.
Principal Components Regression• Regress on M < p principal components of X
Partial Least Squares• Regress on M < p directions of X weighted by y
p
j
j s1
||
p
j
j s1
2
![Page 23: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/23.jpg)
4.2 Prostate Cancer Data Example-Continued
![Page 24: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/24.jpg)
Error Comparison
![Page 25: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/25.jpg)
Shrinkage Methods (Ridge Regression)
Minimize RSS() + T• Use centered data, so 0 is not penalized
• xj are of length p, no longer including the initial 1
The Ridge estimates are:
NxxxNyN
i
ijijij
N
i
i /,/ˆ
11
0
y
yy
yRSS
yyRSS
Tp
T
pTT
T
T
TT
XIXX
IXXXXX
XX
XX
1
00
22
)(
![Page 26: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/26.jpg)
Shrinkage Methods (Ridge Regression)
![Page 27: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/27.jpg)
The Lasso
Use centered data, as before
The L1 penalty makes solutions nonlinear in yi
• Quadratic programming are used to compute them
sxyRSSp
j
j
N
i
p
j
jiji
1
2
1 1
0 ||)( subject to
![Page 28: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/28.jpg)
Shrinkage Methods (Lasso Regression)
![Page 29: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/29.jpg)
Principal Components Regression
Singular Value Decomposition (SVD) of X
• U is N p, V is p p; both are orthogonal• D is a p p diagonal matrix
Use linear combinations (v) of X as new features
• vj is the principal component (column of V) corresponding to the jth largest element of D
• vj are the directions of maximal sample variance
• use only M < p features, [z1…zM] replaces X
TUDVX
Mjvz jj ...1X
m
M
m
mpcr zˆyy
1
mmmm z,z/y,zˆ
![Page 30: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/30.jpg)
Partial Least Squares
Construct linear combinations of inputs incorporating y
Finds directions with maximum variance and correlation with the output
The variance aspect seems to dominate and partial least squares operates like principal component regression
![Page 31: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/31.jpg)
4.4 Methods Using Derived Input Directions (PLS)
• Partial Least Squares
![Page 32: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/32.jpg)
Discussion :a comparison of the selection and shrinkage methods
![Page 33: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/33.jpg)
4.5 Discussion : a comparison of the selection and shrinkage methods
![Page 34: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/34.jpg)
A Unifying View
We can view all the linear regression techniques under a common framework
includes bias, q indicates a prior distribution on =0: least squares >0, q=0: subset selection (counts number of nonzero parameters)
>0, q=1: the lasso >0, q=2: ridge regression
p
j
qj
N
i
p
j
jiji xy1
2
1 1
0 ||minargˆ
![Page 35: 3. Linear Methods for Regression](https://reader036.vdocuments.us/reader036/viewer/2022081511/56815a93550346895dc80c98/html5/thumbnails/35.jpg)
Discussion :a comparison of the selection and shrinkage methods
• Family of Shrinkage Regression