lecture 11: regression methods i (linear...

40
Linear Model Introduction Computation Issues Two Popular Algorithms Lecture 11: Regression Methods I (Linear Regression) Hao Helen Zhang Fall, 2017 Hao Helen Zhang Lecture 11: Regression Methods I (Linear Regression) 1 / 40

Upload: phamkhanh

Post on 14-May-2018

229 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

Lecture 11: Regression Methods I (LinearRegression)

Hao Helen Zhang

Fall, 2017

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)1 / 40

Page 2: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

Outline

1 Regression: Supervised Learning with Continuous Responses2 Linear Models and Multiple Linear Regression

Ordinary Least SquaresStatistical inferencesComputational algorithms

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)2 / 40

Page 3: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Regression Models

If the response Y take real values, we refer this type of supervisedlearning problem as regression problem.

linear regression models

parametric models

nonparametric regression

splines, kernel estimator, local polynomial regression

semiparametric regression

Broad coverage:

penalized regression, regression trees, support vectorregression, quantile regression

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)3 / 40

Page 4: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Linear Regression Models

A standard linear regression model assumes

yi = xTi β + εi , εi ∼ i.i.d, E (εi ) = 0, Var(εi ) = σ2,

yi is the response for the ith observation, xi ∈ Rd is thecovariates

β ∈ Rd is the d-dimensional parameter vector

Common model assumptions:

independence of errors

constant error variance (homoscedasticity)

ε independent of X.

Normality is not needed.

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)4 / 40

Page 5: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

About Linear Models

Linear models has been a mainstay of statistics for the past 30years and remains one of the most important tools.

The covariates may come from different sources

quantitative inputs; dummy coding qualitative inputs.transformed inputs: log(X ),X 2,

√X , ...

basis expansion: X1,X21 ,X

31 , ... (polynomial representation)

interaction between variables: X1X2, ...

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)5 / 40

Page 6: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Review on Matrix Theory (I)

Let A be an m×m matrix. Let Im be the identity matrix of size m.

The determinant of A is detA) = |A|.The trace of A is tr(A) = the sum of the diagonal elements.

The roots of the mth degree of polynomial equation in λ.

|λIm − A| = 0,

denoted by λ1, · · · , λm are called the eigenvalues of A.

The collection {λ1, · · · , λm} is called the spectrum of A.

Any nonzero m × 1 vector xi 6= 0 such that

Axi = λixi

is an eigenvector of A corresponding to the eigenvalue λi .

If B is another m ×m matrix , then

|AB| = |A||B|, tr(AB) = tr(BA).

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)6 / 40

Page 7: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Review on Matrix Theory (II)

The following are equivalent:

|A| 6= 0

rank(A) = m

A−1 exists.

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)7 / 40

Page 8: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Orthogonal Matrix

An m ×m matrix is symmetric if

A′ = A.

An m ×m matrix P is called an orthogonal matrix if

PP ′ = P ′P = Im, or P−1 = P ′.

If P is an orthogonal matrix, then

|PP ′| = |P||P ′| = |P|2 = |I | = 1, so |P| = ±1.

For any m ×m matrix A, we havetr(PAP ′) = tr(AP ′P) = tr(A).

PAP ′ and A have the same eigenvalues, since

|λIm − PAP ′| = |λPP ′ − PAP ′| = |P|2|λIm − A| = |λIm − A|.

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)8 / 40

Page 9: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Spectral Decomposition of Symmetric Matrix

For any m ×m symmetric matrix A, there exists an orthogonalmatrix P such that

P ′AP = Λ = diag{λ1, · · · , λm},

λi ’s are the eigenvalues of A. The corresponding eigenvectors of Aare the column vectors of P.

Denote the m × 1 unit vectors by ei , i = 1, · · · ,m, where eihas 1 in the ith position and zeros elsewhere.

The ith column of P is pi = Pei , i = 1, · · · ,m. NotePP ′ =

∑mi=1 pip

′i = Im.

The spectral decomposition of A is

A = PΛP ′ =m∑i=1

λipip′i

tr(A) = tr(Λ) =∑n

i=1 λi and |A| = |Λ| =∏m

i=1 λi .

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)9 / 40

Page 10: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Idempotent Matrices

An m ×m matrix A is idempotent if

A2 = AA = A.

The eigenvalues of an idempotent matrix are either zero or one

λx = Ax = A(Ax) = A(λx) = λ2x, =⇒ λ = λ2.

A symmetric idempotent matrix A is also referred to as aprojection matrix. For any x ∈ Rm,

the vector y = Ax is the orthogonal projection of x onto thesubspace of Rm generated by the column vectors of A.

the vector z = (I − A)x is the orthogonal projection of x ontothe complementary subspace such that

x = y + z = Ax + (I − A)x.

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)10 / 40

Page 11: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Projection Matrices

If A is an symmetric idempotent, then

If rank(A) = r , then A has r eigenvalues equal to 1 and n − 4zero eigenvalues

tr(A) = rank(A).

Im − A is also symmetric idempotent, of rank m − r .

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)11 / 40

Page 12: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Matrix Notations for Linear Regression

The response vector y = (y1, · · · , yn)T

The design matrix X .

Assume the first column of X is 1.The dimension of X is n × (1 + d).

The regression coefficients β =

(β0β1

).

The error vector ε = (ε1, · · · , εn))T .

The linear model is written as:

y = Xβ + ε

the estimated coefficients β

the predicted response y = X β.

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)12 / 40

Page 13: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Ordinary Least Squares (OLS)

The most popular method for fitting the linear model is theordinary least squares (OLS):

minβ

RSS(β) = (y − Xβ)T (y − Xβ).

Normal equations: XT (y − Xβ) = 0

β = (XTX )−1XTy and y = X (XTX )−1XTy.

Residual vector is r = y − y = (I − PX )y.

Residual sum squares RSS = rT r.

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)13 / 40

Page 14: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Projection Matrix

Call the following square matrix the projection or hat matrix:

PX = X (XTX )−1XT .

Properties:

symmetric and non-negative

idempotent: P2X = PX . The eigenvalues are 0’s and 1’s.

XTPX = XT , XT (I − PX ) = 0.

We haver = (I − PX )y, RSS = yT (I − PX )y.

NoteXT r = XT (I − PX )y = 0.

The residual vector is orthogonal to the column space spanned byX , col(X ).

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)14 / 40

Page 15: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 3PSfrag repla ements

x1x2

y

yFigure 3.2: The N-dimensional geometry of leastsquares regression with two predi tors. The out omeve tor y is orthogonally proje ted onto the hyperplanespanned by the input ve tors x1 and x2. The proje tiony represents the ve tor of the least squares predi tions

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)15 / 40

Page 16: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Sampling Properties of β

Var(β) = σ2(XTX )−1,

The variance σ2 can be estimated as

σ2 = SSE/(n − d − 1).

This is an unbiased estimator, i.e., E(σ2) = σ2

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)16 / 40

Page 17: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Inferences for Gaussian Errors

Under the Normal assumption on the error ε, we have

β ∼ N(β, σ2(XTX )−1)

(n − d − 1)σ2 ∼ σ2χ2n−d−1

β is independent of σ2

To test H0 : βj = 0, we use

if σ2 is known, zj =βj

σ√vj

has a Z distribution under H0;

if σ2 is unknown, tj =βj

σ√vj

has a tn−d−1 distribution under

H0;

where vj is the jth diagonal element of (XTX )−1.

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)17 / 40

Page 18: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Confidence Interval for Individual Coefficients

Under Normal assumption, the 100(1− α)% C.I. of βj is

βj ± tn−d−1;α2σ√vj ,

where tk;ν is 1− ν percentile of tk distribution.

In practice, we use the approximate 100(1− α)% C.I. of βj

βj ± zα2σ√vj ,

where zα2

is 1− α2 percentile of the standard Normal

distribution.

Even if the Gaussian assumption does not hold, this interval isapproximately right, with the coverage probability 1− α asn→∞.

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)18 / 40

Page 19: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Review on Multivariate Normal Distributions

Distributions of Quadratic Form (Non-central χ2):

If X ∼ Np(µ, Ip), then

W = XTX =

p∑i=1

X 2i ∼ χ2

p(λ), λ =1

2µTµ.

Special case: If X ∼ Np(0, Ip), then W = XTX ∼ χ2p.

If X ∼ Np(µ,V ) where V is nonsingular, then

W = XTV−1X ∼ χ2p(λ), λ =

1

2µTV−1µ.

If X ∼ Np(µ,V ) with V nonsingular, if A is symmetric andAV is idempotent with rank s, then

W = XTAX ∼ χ2s (λ), λ =

1

2µTAµ.

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)19 / 40

Page 20: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Cochran’s Theorem

Let y ∼ Nn(µ, σ2In) and let Aj , j = 1, · · · , J be symmetricidempotent matrices with rank sj . Furthermore, assume that∑J

j=1 Aj = In and∑J

j=1 sj = n, then

(i)

Wj =1

σ2yTAjy ∼ χ2

sj(λj),

where λj = 12σ2µ

TAjµ

(ii) Wj ’s are mutually independent with each other.

Essentially: we decompose yTy into the (scaled) sum of itsquadratic forms,

n∑i=1

y2i yi = yT Iny =J∑

j=1

yTAjy.

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)20 / 40

Page 21: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Application of Cochran’s Theorem to Linear Models

Example: Assume y ∼ Nn(Xβ, σ2In). Define A = I − PX and

the residual sum of squares: RSS = yTAy = ‖r‖2

the sum of squares regression: SSR = yTPXy = ‖y‖2.By Cochran’s Theorem, we have

(i)RSS/σ2 ∼ χ2

n−d−1, SSR/σ2 ∼ χ2d+1(λ),

where λ = (Xβ)T (Xβ)/(2σ2),

(ii) RSS is independent from SSR. (Note r ⊥ y)

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)21 / 40

Page 22: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

F Distribution

If U1 ∼ χ2p,U2 ∼ χ2

q and U1 ⊥ U2, then

F =U1/p

U2/q∼ Fp,q.

If U1 ∼ χ2p(λ),U2 ∼ χ2

q and U1 ⊥ U2, then

F =U1/p

U2/q∼ Fp,q(λ), (noncentral F )

Example: Assume y ∼ Nn(Xβ, σ2In). Let A = I − PX , and

RSS = yTAyT = ‖r‖2, SSR = yTPXy = ‖y‖2.

Then

F =SSR/(d + 1)

RSS/(n − d − 1)∼ Fd+1,n−d−1(λ), λ = ‖Xβ‖2/(2σ2).

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)22 / 40

Page 23: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Making Inferences about Multiple Parameters

Assume X = [X0,X1], where X0 consists of the first k columns.Correspondingly, β = [β′0,β

′1]′. To test H0 : β0 = 0, using

F =(RSS1 − RSS)/k

RSS/(n − d − 1)

RSS1 = yT (I − PX1)y (reduced model).

RSS = yT (I − PX )y (full model)

RSS1 ∼ σ2χ2n−d−1.

RSS1 − RSS = yT (PX − PX1)y.

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)23 / 40

Page 24: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Testing Multiple Parameter

Applying Cochran’s Theorem to RSS1,RSS and RSS1 − RSS ,

they are independent

they respectively follow noncentral χ2 distributions, withnoncentralities (Xβ)T (I − PX1)(Xβ)/(2σ2), 0, and(Xβ)T (PX − PX1)(Xβ)/(2σ2).

. Then we have

F ∼ Fk,n−d−1(λ), with λ = (Xβ)T (PX − PX1)(Xβ)/(2σ2).

Under H0, we have Xβ = X1β1, so F ∼ Fk,n−d−1.

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)24 / 40

Page 25: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Nested Model Selection

To test for significance of groups of coefficients simultaneously, weuse F -statistic

F =(RSS0 − RSS1)/(d1 − d0)

RSS1/(n − d1 − 1),

where

RSS1 is the RSS for the bigger model with d1 + 1 parameters

RSS0 is the RSS for the nested smaller model with d0 + 1parameter, have d1 − d0 parameters constrained to zero.

F -statistic measure the change in RSS per additional parameter inthe bigger model, and it is normalized by σ2.

Under the assumption that the smaller model is correct,F ∼ Fd1−d0,n−d1−1.

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)25 / 40

Page 26: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Confidence Set

The approximate confidence set of β is

Cβ = {β|(β − β)T (XTX )(β − β) ≤ σ2χ2d+1;1−α},

where χ2k;1−α is 1− α percentile of χ2

k distribution.

The confidence interval for the true function f (x) = xTβ is

{xTβ|β ∈ Cβ}

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)26 / 40

Page 27: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

OLS MethodStatistical Inferences

Gauss-Markov Theorem

Assume sTβ is linearly estimable, i.e., there exists a linearestimator b + cTy such that E (b + cTy) = sTβ.

A function sTβ is linearly estimable iff s = XTa for some a.

Theorem: If sTβ is linearly estimable, then sT β is the best linearunbiased estimator (BLUE) of sTβ:

For any cTy satisfying E (cTy) = sTβ, we have

Var(sT β) ≤ Var(cTy).

sT β is the best among all the unbiased estimators. (It is afunction of the complete and sufficient statistic (yTy,XTy).)

Question: Is it possible to find a slightly biased linear estimator butwith smaller variance? (– Trade a little bias for a large reduction invariance.)

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)27 / 40

Page 28: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular AlgorithmsOrthogonalize Design Matrix

Linear Regression with Orthogonal Design

If X is univariate, the least square estimate is

β =

∑i xiyi∑i x

2i

=< x, y >

< x, x >.

if X = [x1, ..., xd ] has orthogonal columns, i.e.,

< xj , xk >= 0, ∀j 6= k;

or equivalently, XTX = diag(‖x1‖2, ..., ‖xd‖2

). The OLS

estimates are given as

βj =< xj , y >

< xj , xj >for j = 1, ..., d .

Each input has no effect on the estimation of other parameters.Multiple linear regression reduces to univariate regression.

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)28 / 40

Page 29: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular AlgorithmsOrthogonalize Design Matrix

How to orthogonalize X?

Consider the simple linear regression y = β01 + β1x + ε.We regress x onto 1 and obtain the residual

z = x− x1.

Orthogonalization Process:

The residual z is orthogonal to the regressor 1.

The column space of X is span{1, x}.Note: y ∈ span{1, x} = span{1, z}, because

β01 + β1x = β0 + β1[x1 + (x− x1)]

= β0 + β1[x1 + z]

= (β0 + β1x)1 + β1z

= η01 + β1z.

{1, z} form an orthogonal basis for the column space of X .

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)29 / 40

Page 30: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular AlgorithmsOrthogonalize Design Matrix

How to orthogonalize X? (continued)

Estimation Process:

First, we regress y onto z for the OLS estimate of the slope β1

β1 =< y, z >

< z, z >=

< y, x− x1 >

< x− x1, x− x1 >.

Second, we regress y onto 1 and get the coefficient η0 = y .

The OLS fit is given as

y = η01 + β1z

= η01 + β1(x− x1) = (η0 − β1x)1 + β1x.

Therefore, the OLS slope is obtained as

β0 = η0 − β1x = y − β1x .

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)30 / 40

Page 31: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular AlgorithmsOrthogonalize Design Matrix

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 3PSfrag repla ements

x1x2

y

y zzzzzFigure 3.4: Least squares regression by orthogonaliza-tion of the inputs. The ve tor x2 is regressed on theve tor x1, leaving the residual ve tor z. The regressionof y on z gives the multiple regression oeÆ ient of x2.Adding together the proje tions of y on ea h of x1 andz gives the least squares �t y.

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)31 / 40

Page 32: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular AlgorithmsOrthogonalize Design Matrix

How to orthogonalize X? (d=2)Consider y = β1x1 + β2x2 + β3x3 + ε.Orthogonization process:

1 We regress x2 onto x1, compute the residual

z1 = x2 − γ12x1. (note z1 ⊥ x1)

2 We regress x3 onto (x1, z1), compute the residual

z2 = x3 − γ13x1 − γ23z1. (note z2 ⊥ {x1, z1})

Note: span{x1, x2, x3} = span{x1, z1, z2}, because

β1x1 + β2x2 + β3x3 = β1x1 + β2(γ12x1 + z1) + β3(γ13x1 + γ23z1 + z2)

= (β1 + β2γ12 + β3γ13)x1 + (β2 + β3γ23)z1 + β3z2

= η1x1 + η2z1 + β3z2.

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)32 / 40

Page 33: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular AlgorithmsOrthogonalize Design Matrix

Estimation Process

We project y onto the orthogonal basis {x1, z2, z3} one by one,and then recover the coefficients corresponding to the originalcolumns of X .

First, we regress y onto z2 for the OLS estimate of the slopeβ3

β3 =< y, z2 >

< z2, z2 >

Second, we regress y onto z1, leading to the coefficient η2,and

β2 = η2 − β3γ23Third, we regress y onto x1, leading to the coefficient η1, and

β1 = η1 − β3γ13 − β2γ12

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)33 / 40

Page 34: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular AlgorithmsOrthogonalize Design Matrix

Gram-Schmidt Procedure (Successive Orthogonalization)

1 Initialize z0 = x0 = 12 For j = 1, ..., d Regression xj on z0, z1, ..., zj−1 to produce

coefficients γkj =<zk ,xj><zk ,zk>

for k = 0, ..., j − 1, and residual

vector zj = xj −∑j−1

k=0 γkjzk . ({z0, z1, ..., zj−1} areorthogonal)

3 Regress y on the residual zd to get

βd =< y, zd >

< zd , zd >

4 Compute βj , j = d − 1, · · · , 0 in that order successively.

{z0, z1, ..., zd} forms orthogonal basis for Col(X ).

Multiple regression coefficient βj is the additional contributionof xj to y, after xj has been adjusted forx0, x1, ..., xj−1, xj+1, ..., xd .

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)34 / 40

Page 35: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular AlgorithmsOrthogonalize Design Matrix

Collinearity Issue

The dth coefficientβd =

< zd , y >

< zd , zd >

If xd is highly correlated with some of the other x′js, then

The residual vector zd is close to zero

The coefficient βd will be very unstable

The variance estimates

Var(βd) =σ2

‖zd‖2.

The precision for estimating βd depends on the length of zd ,or, how much xd is unexplained by the other xk ’s

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)35 / 40

Page 36: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

QR DecompositionCholesky Algorithm

Two Computational Algorithms For Multiple Regression

Consider the Normal Equation

XTXβ = XTy.

We like to avoid computing (XTX )−1 directly.1 QR decomposition of X

X = QR where Q is orthonormal and R is upper triangularEssentially, a process of orthogonal matrix triangularization

2 Cholesky decomposition of XTX .

XTX = RRT where R is lower triangular

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)36 / 40

Page 37: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

QR DecompositionCholesky Algorithm

Matrix Formulation of Orthogonalization

In Step 2 of Gram-Schmidt procedure, for j = 1, ..., d

zj = xj −j−1∑k=0

γkjzk =⇒ xj =

j−1∑k=0

γkjzk + zj .

In matrix form X = [x1, ..., xd ] and Z = [z1, ..., zd ],

X = ZΓ

The columns of Z are orthogonal to each other

The matrix Γ is upper triangular, with 1 at the diagonals.

Standardizing Z using D = diag{‖z1‖, ..., ‖zd‖},

X = ZΓ = ZD−1DΓ ≡ QR, with Q = ZD−1, R = DΓ.

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)37 / 40

Page 38: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

QR DecompositionCholesky Algorithm

QR Decomposition

The columns of Q consists of an orthonormal basis for thecolumn space of X .

Q is orthogonal matrix of n × d , satisfying QTQ = I .

R is upper triangular matrix of d × d , full-ranked.

XTX = (QR)T (QR) = RTQTQR = RTR

The least square solutions are

β = (XTX )−1XTy

= R−1R−TRTQTy = R−1QTy

y = X β

= (QR)(R−1QTy)

= QQTy.

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)38 / 40

Page 39: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

QR DecompositionCholesky Algorithm

QR Algorithm for Normal Equations

Regard β as the solution for linear equations system:

Rβ = QTy.

1 Conduct QR decomposition of X = QR. (Gram-SchmidtOrthogonalization)

2 Compute QTy.

3 Solve the triangular system Rβ = QTy.

The computational complexity: nd2

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)39 / 40

Page 40: Lecture 11: Regression Methods I (Linear Regression)math.arizona.edu/~hzhang/math574m/2017Lect11_lm.pdf · Lecture 11: Regression Methods I (Linear Regression) 7/40. Linear Model

Linear Model IntroductionComputation Issues

Two Popular Algorithms

QR DecompositionCholesky Algorithm

Cholesky Decomposition Algorithm

For any positive definite square matrix A, we have

A = RRT ,

where R is a lower triangular matrix of full rank.

1 Compute XTX and XTy.

2 Factoring XTX = RRT , then β = (RT )−1R−1XTy

3 Solve the triangular system Rw = XTy for w.

4 Solve the triangular system RTβ = w for β.

The computational complexity: d3 + nd2/2 (can be faster thanQR for small d , but can be less stable)

Var(y0) = Var(xT0 β) = σ2(xT0 (RT )−1R−1x0).

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)40 / 40