lecture 11: regression methods i (linear...

Linear Model IntroductionComputation Issues

Two Popular Algorithms

Lecture 11: Regression Methods I (LinearRegression)

Hao Helen Zhang

Fall, 2017

Hao Helen ZhangLecture 11: Regression Methods I (Linear Regression)1 / 40



Outline

1 Regression: Supervised Learning with Continuous Responses2 Linear Models and Multiple Linear Regression

Ordinary Least SquaresStatistical inferencesComputational algorithms




OLS MethodStatistical Inferences

Regression Models

If the response Y take real values, we refer this type of supervisedlearning problem as regression problem.

linear regression models

parametric models

nonparametric regression

splines, kernel estimator, local polynomial regression

semiparametric regression

Broad coverage:

penalized regression, regression trees, support vectorregression, quantile regression





Linear Regression Models

A standard linear regression model assumes

yi = xTi β + εi , εi ∼ i.i.d, E (εi ) = 0, Var(εi ) = σ2,

yi is the response for the ith observation, xi ∈ Rd is thecovariates

β ∈ Rd is the d-dimensional parameter vector

Common model assumptions:

independence of errors

constant error variance (homoscedasticity)

ε independent of X.

Normality is not needed.





About Linear Models

Linear models has been a mainstay of statistics for the past 30years and remains one of the most important tools.

The covariates may come from different sources

quantitative inputs; dummy coding qualitative inputs.transformed inputs: log(X ),X 2,

√X , ...

basis expansion: X1,X21 ,X

31 , ... (polynomial representation)

interaction between variables: X1X2, ...





Review on Matrix Theory (I)

Let A be an m×m matrix. Let Im be the identity matrix of size m.

The determinant of A is detA) = |A|.The trace of A is tr(A) = the sum of the diagonal elements.

The roots of the mth degree of polynomial equation in λ.

|λIm − A| = 0,

denoted by λ1, · · · , λm are called the eigenvalues of A.

The collection {λ1, · · · , λm} is called the spectrum of A.

Any nonzero m × 1 vector xi 6= 0 such that

Axi = λixi

is an eigenvector of A corresponding to the eigenvalue λi .

If B is another m ×m matrix , then

|AB| = |A||B|, tr(AB) = tr(BA).





Review on Matrix Theory (II)

The following are equivalent:

|A| 6= 0

rank(A) = m

A−1 exists.





Orthogonal Matrix

An m ×m matrix is symmetric if

A′ = A.

An m ×m matrix P is called an orthogonal matrix if

PP ′ = P ′P = Im, or P−1 = P ′.

If P is an orthogonal matrix, then

|PP ′| = |P||P ′| = |P|2 = |I | = 1, so |P| = ±1.

For any m ×m matrix A, we havetr(PAP ′) = tr(AP ′P) = tr(A).

PAP ′ and A have the same eigenvalues, since

|λIm − PAP ′| = |λPP ′ − PAP ′| = |P|2|λIm − A| = |λIm − A|.





Spectral Decomposition of Symmetric Matrix

For any m ×m symmetric matrix A, there exists an orthogonalmatrix P such that

P ′AP = Λ = diag{λ1, · · · , λm},

λi ’s are the eigenvalues of A. The corresponding eigenvectors of Aare the column vectors of P.

Denote the m × 1 unit vectors by ei , i = 1, · · · ,m, where eihas 1 in the ith position and zeros elsewhere.

The ith column of P is pi = Pei , i = 1, · · · ,m. NotePP ′ =

∑mi=1 pip

′i = Im.

The spectral decomposition of A is

A = PΛP ′ =m∑i=1

λipip′i

tr(A) = tr(Λ) =∑n

i=1 λi and |A| = |Λ| =∏m

i=1 λi .





Idempotent Matrices

An m ×m matrix A is idempotent if

A2 = AA = A.

The eigenvalues of an idempotent matrix are either zero or one

λx = Ax = A(Ax) = A(λx) = λ2x, =⇒ λ = λ2.

A symmetric idempotent matrix A is also referred to as aprojection matrix. For any x ∈ Rm,

the vector y = Ax is the orthogonal projection of x onto thesubspace of Rm generated by the column vectors of A.

the vector z = (I − A)x is the orthogonal projection of x ontothe complementary subspace such that

x = y + z = Ax + (I − A)x.





Projection Matrices

If A is an symmetric idempotent, then

If rank(A) = r , then A has r eigenvalues equal to 1 and n − 4zero eigenvalues

tr(A) = rank(A).

Im − A is also symmetric idempotent, of rank m − r .





Matrix Notations for Linear Regression

The response vector y = (y1, · · · , yn)T

The design matrix X .

Assume the first column of X is 1.The dimension of X is n × (1 + d).

The regression coefficients β =

(β0β1

).

The error vector ε = (ε1, · · · , εn))T .

The linear model is written as:

y = Xβ + ε

the estimated coefficients β

the predicted response y = X β.





Ordinary Least Squares (OLS)

The most popular method for fitting the linear model is theordinary least squares (OLS):

minβ

RSS(β) = (y − Xβ)T (y − Xβ).

Normal equations: XT (y − Xβ) = 0

β = (XTX )−1XTy and y = X (XTX )−1XTy.

Residual vector is r = y − y = (I − PX )y.

Residual sum squares RSS = rT r.





Projection Matrix

Call the following square matrix the projection or hat matrix:

PX = X (XTX )−1XT .

Properties:

symmetric and non-negative

idempotent: P2X = PX . The eigenvalues are 0’s and 1’s.

XTPX = XT , XT (I − PX ) = 0.

We haver = (I − PX )y, RSS = yT (I − PX )y.

NoteXT r = XT (I − PX )y = 0.

The residual vector is orthogonal to the column space spanned byX , col(X ).





Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 3PSfrag repla ements

x1x2

y

yFigure 3.2: The N-dimensional geometry of leastsquares regression with two predi tors. The out omeve tor y is orthogonally proje ted onto the hyperplanespanned by the input ve tors x1 and x2. The proje tiony represents the ve tor of the least squares predi tions





Sampling Properties of β

Var(β) = σ2(XTX )−1,

The variance σ2 can be estimated as

σ2 = SSE/(n − d − 1).

This is an unbiased estimator, i.e., E(σ2) = σ2





Inferences for Gaussian Errors

Under the Normal assumption on the error ε, we have

β ∼ N(β, σ2(XTX )−1)

(n − d − 1)σ2 ∼ σ2χ2n−d−1

β is independent of σ2

To test H0 : βj = 0, we use

if σ2 is known, zj =βj

σ√vj

has a Z distribution under H0;

if σ2 is unknown, tj =βj

σ√vj

has a tn−d−1 distribution under

H0;

where vj is the jth diagonal element of (XTX )−1.





Confidence Interval for Individual Coefficients

Under Normal assumption, the 100(1− α)% C.I. of βj is

βj ± tn−d−1;α2σ√vj ,

where tk;ν is 1− ν percentile of tk distribution.

In practice, we use the approximate 100(1− α)% C.I. of βj

βj ± zα2σ√vj ,

where zα2

is 1− α2 percentile of the standard Normal

distribution.

Even if the Gaussian assumption does not hold, this interval isapproximately right, with the coverage probability 1− α asn→∞.





Review on Multivariate Normal Distributions

Distributions of Quadratic Form (Non-central χ2):

If X ∼ Np(µ, Ip), then

W = XTX =

p∑i=1

X 2i ∼ χ2

p(λ), λ =1

2µTµ.

Special case: If X ∼ Np(0, Ip), then W = XTX ∼ χ2p.

If X ∼ Np(µ,V ) where V is nonsingular, then

W = XTV−1X ∼ χ2p(λ), λ =

1

2µTV−1µ.

If X ∼ Np(µ,V ) with V nonsingular, if A is symmetric andAV is idempotent with rank s, then

W = XTAX ∼ χ2s (λ), λ =

1

2µTAµ.





Cochran’s Theorem

Let y ∼ Nn(µ, σ2In) and let Aj , j = 1, · · · , J be symmetricidempotent matrices with rank sj . Furthermore, assume that∑J

j=1 Aj = In and∑J

j=1 sj = n, then

(i)

Wj =1

σ2yTAjy ∼ χ2

sj(λj),

where λj = 12σ2µ

TAjµ

(ii) Wj ’s are mutually independent with each other.

Essentially: we decompose yTy into the (scaled) sum of itsquadratic forms,

n∑i=1

y2i yi = yT Iny =J∑

j=1

yTAjy.





Application of Cochran’s Theorem to Linear Models

Example: Assume y ∼ Nn(Xβ, σ2In). Define A = I − PX and

the residual sum of squares: RSS = yTAy = ‖r‖2

the sum of squares regression: SSR = yTPXy = ‖y‖2.By Cochran’s Theorem, we have

(i)RSS/σ2 ∼ χ2

n−d−1, SSR/σ2 ∼ χ2d+1(λ),

where λ = (Xβ)T (Xβ)/(2σ2),

(ii) RSS is independent from SSR. (Note r ⊥ y)





F Distribution

If U1 ∼ χ2p,U2 ∼ χ2

q and U1 ⊥ U2, then

F =U1/p

U2/q∼ Fp,q.

If U1 ∼ χ2p(λ),U2 ∼ χ2

q and U1 ⊥ U2, then

F =U1/p

U2/q∼ Fp,q(λ), (noncentral F )

Example: Assume y ∼ Nn(Xβ, σ2In). Let A = I − PX , and

RSS = yTAyT = ‖r‖2, SSR = yTPXy = ‖y‖2.

Then

F =SSR/(d + 1)

RSS/(n − d − 1)∼ Fd+1,n−d−1(λ), λ = ‖Xβ‖2/(2σ2).





Making Inferences about Multiple Parameters

Assume X = [X0,X1], where X0 consists of the first k columns.Correspondingly, β = [β′0,β

′1]′. To test H0 : β0 = 0, using

F =(RSS1 − RSS)/k

RSS/(n − d − 1)

RSS1 = yT (I − PX1)y (reduced model).

RSS = yT (I − PX )y (full model)

RSS1 ∼ σ2χ2n−d−1.

RSS1 − RSS = yT (PX − PX1)y.





Testing Multiple Parameter

Applying Cochran’s Theorem to RSS1,RSS and RSS1 − RSS ,

they are independent

they respectively follow noncentral χ2 distributions, withnoncentralities (Xβ)T (I − PX1)(Xβ)/(2σ2), 0, and(Xβ)T (PX − PX1)(Xβ)/(2σ2).

. Then we have

F ∼ Fk,n−d−1(λ), with λ = (Xβ)T (PX − PX1)(Xβ)/(2σ2).

Under H0, we have Xβ = X1β1, so F ∼ Fk,n−d−1.





Nested Model Selection

To test for significance of groups of coefficients simultaneously, weuse F -statistic

F =(RSS0 − RSS1)/(d1 − d0)

RSS1/(n − d1 − 1),

where

RSS1 is the RSS for the bigger model with d1 + 1 parameters

RSS0 is the RSS for the nested smaller model with d0 + 1parameter, have d1 − d0 parameters constrained to zero.

F -statistic measure the change in RSS per additional parameter inthe bigger model, and it is normalized by σ2.

Under the assumption that the smaller model is correct,F ∼ Fd1−d0,n−d1−1.





Confidence Set

The approximate confidence set of β is

Cβ = {β|(β − β)T (XTX )(β − β) ≤ σ2χ2d+1;1−α},

where χ2k;1−α is 1− α percentile of χ2

k distribution.

The confidence interval for the true function f (x) = xTβ is

{xTβ|β ∈ Cβ}





Gauss-Markov Theorem

Assume sTβ is linearly estimable, i.e., there exists a linearestimator b + cTy such that E (b + cTy) = sTβ.

A function sTβ is linearly estimable iff s = XTa for some a.

Theorem: If sTβ is linearly estimable, then sT β is the best linearunbiased estimator (BLUE) of sTβ:

For any cTy satisfying E (cTy) = sTβ, we have

Var(sT β) ≤ Var(cTy).

sT β is the best among all the unbiased estimators. (It is afunction of the complete and sufficient statistic (yTy,XTy).)

Question: Is it possible to find a slightly biased linear estimator butwith smaller variance? (– Trade a little bias for a large reduction invariance.)



Two Popular AlgorithmsOrthogonalize Design Matrix

Linear Regression with Orthogonal Design

If X is univariate, the least square estimate is

β =

∑i xiyi∑i x

2i

=< x, y >

< x, x >.

if X = [x1, ..., xd ] has orthogonal columns, i.e.,

< xj , xk >= 0, ∀j 6= k;

or equivalently, XTX = diag(‖x1‖2, ..., ‖xd‖2

). The OLS

estimates are given as

βj =< xj , y >

< xj , xj >for j = 1, ..., d .

Each input has no effect on the estimation of other parameters.Multiple linear regression reduces to univariate regression.




How to orthogonalize X?

Consider the simple linear regression y = β01 + β1x + ε.We regress x onto 1 and obtain the residual

z = x− x1.

Orthogonalization Process:

The residual z is orthogonal to the regressor 1.

The column space of X is span{1, x}.Note: y ∈ span{1, x} = span{1, z}, because

β01 + β1x = β0 + β1[x1 + (x− x1)]

= β0 + β1[x1 + z]

= (β0 + β1x)1 + β1z

= η01 + β1z.

{1, z} form an orthogonal basis for the column space of X .




How to orthogonalize X? (continued)

Estimation Process:

First, we regress y onto z for the OLS estimate of the slope β1

β1 =< y, z >

< z, z >=

< y, x− x1 >

< x− x1, x− x1 >.

Second, we regress y onto 1 and get the coefficient η0 = y .

The OLS fit is given as

y = η01 + β1z

= η01 + β1(x− x1) = (η0 − β1x)1 + β1x.

Therefore, the OLS slope is obtained as

β0 = η0 − β1x = y − β1x .




Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 3PSfrag repla ements

x1x2

y

y zzzzzFigure 3.4: Least squares regression by orthogonaliza-tion of the inputs. The ve tor x2 is regressed on theve tor x1, leaving the residual ve tor z. The regressionof y on z gives the multiple regression oeÆ ient of x2.Adding together the proje tions of y on ea h of x1 andz gives the least squares �t y.




How to orthogonalize X? (d=2)Consider y = β1x1 + β2x2 + β3x3 + ε.Orthogonization process:

1 We regress x2 onto x1, compute the residual

z1 = x2 − γ12x1. (note z1 ⊥ x1)

2 We regress x3 onto (x1, z1), compute the residual

z2 = x3 − γ13x1 − γ23z1. (note z2 ⊥ {x1, z1})

Note: span{x1, x2, x3} = span{x1, z1, z2}, because

β1x1 + β2x2 + β3x3 = β1x1 + β2(γ12x1 + z1) + β3(γ13x1 + γ23z1 + z2)

= (β1 + β2γ12 + β3γ13)x1 + (β2 + β3γ23)z1 + β3z2

= η1x1 + η2z1 + β3z2.




Estimation Process

We project y onto the orthogonal basis {x1, z2, z3} one by one,and then recover the coefficients corresponding to the originalcolumns of X .

First, we regress y onto z2 for the OLS estimate of the slopeβ3

β3 =< y, z2 >

< z2, z2 >

Second, we regress y onto z1, leading to the coefficient η2,and

β2 = η2 − β3γ23Third, we regress y onto x1, leading to the coefficient η1, and

β1 = η1 − β3γ13 − β2γ12




Gram-Schmidt Procedure (Successive Orthogonalization)

1 Initialize z0 = x0 = 12 For j = 1, ..., d Regression xj on z0, z1, ..., zj−1 to produce

coefficients γkj =<zk ,xj><zk ,zk>

for k = 0, ..., j − 1, and residual

vector zj = xj −∑j−1

k=0 γkjzk . ({z0, z1, ..., zj−1} areorthogonal)

3 Regress y on the residual zd to get

βd =< y, zd >

< zd , zd >

4 Compute βj , j = d − 1, · · · , 0 in that order successively.

{z0, z1, ..., zd} forms orthogonal basis for Col(X ).

Multiple regression coefficient βj is the additional contributionof xj to y, after xj has been adjusted forx0, x1, ..., xj−1, xj+1, ..., xd .




Collinearity Issue

The dth coefficientβd =

< zd , y >

< zd , zd >

If xd is highly correlated with some of the other x′js, then

The residual vector zd is close to zero

The coefficient βd will be very unstable

The variance estimates

Var(βd) =σ2

‖zd‖2.

The precision for estimating βd depends on the length of zd ,or, how much xd is unexplained by the other xk ’s




QR DecompositionCholesky Algorithm

Two Computational Algorithms For Multiple Regression

Consider the Normal Equation

XTXβ = XTy.

We like to avoid computing (XTX )−1 directly.1 QR decomposition of X

X = QR where Q is orthonormal and R is upper triangularEssentially, a process of orthogonal matrix triangularization

2 Cholesky decomposition of XTX .

XTX = RRT where R is lower triangular





Matrix Formulation of Orthogonalization

In Step 2 of Gram-Schmidt procedure, for j = 1, ..., d

zj = xj −j−1∑k=0

γkjzk =⇒ xj =

j−1∑k=0

γkjzk + zj .

In matrix form X = [x1, ..., xd ] and Z = [z1, ..., zd ],

X = ZΓ

The columns of Z are orthogonal to each other

The matrix Γ is upper triangular, with 1 at the diagonals.

Standardizing Z using D = diag{‖z1‖, ..., ‖zd‖},

X = ZΓ = ZD−1DΓ ≡ QR, with Q = ZD−1, R = DΓ.





QR Decomposition

The columns of Q consists of an orthonormal basis for thecolumn space of X .

Q is orthogonal matrix of n × d , satisfying QTQ = I .

R is upper triangular matrix of d × d , full-ranked.

XTX = (QR)T (QR) = RTQTQR = RTR

The least square solutions are

β = (XTX )−1XTy

= R−1R−TRTQTy = R−1QTy

y = X β

= (QR)(R−1QTy)

= QQTy.





QR Algorithm for Normal Equations

Regard β as the solution for linear equations system:

Rβ = QTy.

1 Conduct QR decomposition of X = QR. (Gram-SchmidtOrthogonalization)

2 Compute QTy.

3 Solve the triangular system Rβ = QTy.

The computational complexity: nd2





Cholesky Decomposition Algorithm

For any positive definite square matrix A, we have

A = RRT ,

where R is a lower triangular matrix of full rank.

1 Compute XTX and XTy.

2 Factoring XTX = RRT , then β = (RT )−1R−1XTy

3 Solve the triangular system Rw = XTy for w.

4 Solve the triangular system RTβ = w for β.

The computational complexity: d3 + nd2/2 (can be faster thanQR for small d , but can be less stable)

Var(y0) = Var(xT0 β) = σ2(xT0 (RT )−1R−1x0).


lecture 11: regression methods i (linear...

Documents