linear regression using r

35
Linear Regression using R Sungsu Lim Applied Algorithm Lab. KAIST 1/35

Upload: dylan-franks

Post on 31-Dec-2015

47 views

Category:

Documents


1 download

DESCRIPTION

Linear Regression using R. Sungsu Lim Applied Algorithm Lab. KAIST. 1/35. Regression. Regression analysis answers questions about the dependencies of a response variable on one or more predictors, - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Linear Regression using R

Linear Regression using R

Sungsu LimApplied Algorithm Lab.

KAIST

1/35

Page 2: Linear Regression using R

Regression

• Regression analysis answers questions about the depen-dencies of a response variable on one or more predictors,• including prediction of future values of a response, dis-covering which predictors are important, and estimating the impact of changing a predictor or a treatment on the value of the response.• In linear regression, models of the unknown parameters are estimated from the data using linear functions. (Usually, the conditional mean of Y given the value of X)

2/35

Page 3: Linear Regression using R

Correlation Coefficient

• The correlation coefficient between two random vari-ables X and Y is defined as• If we have a series of n measurements of X and Y, then the sample correlation coefficient is defined as

• It has a value between -1 and 1, and it indicates the de-gree of linear dependence between the variables. It de-tects only linear dependencies between two variables. 3/35

Page 4: Linear Regression using R

Example

> install.packages("alr3") # Installing a package> library(alr3) # loading a package> data(fuel2001) # loading a specific data set> fueldata<-fuel2001[,1:5]> fueldata[,1]<-fuel2001$Tax> fueldata[,2]<-1000*fuel2001$Drivers/fuel2001$Pop> fueldata[,3]<-fuel2001$Income/1000> fueldata[,4]<-log(fuel2001$Miles,2)> fueldata[,5]<-1000*fuel2001$FuelC/fuel2001$Pop> colnames(fueldata)<-c("Tax","Dlic","Income","logMiles","Fuel")

4/35

Page 5: Linear Regression using R

Example

> cor(fueldata) Tax Dlic Income logMiles FuelTax 1.00000000 -0.08584424 -0.01068494 -0.04373696 -0.2594471Dlic -0.08584424 1.00000000 -0.17596063 0.03059068 0.4685063Income -0.01068494 -0.17596063 1.00000000 -0.29585136 -0.4644050logMiles -0.04373696 0.03059068 -0.29585136 1.00000000 0.4220323Fuel -0.25944711 0.46850627 -0.46440498 0.42203233 1.0000000> round(cor(fueldata),2) Tax Dlic Income logMiles FuelTax 1.00 -0.09 -0.01 -0.04 -0.26Dlic -0.09 1.00 -0.18 0.03 0.47Income -0.01 -0.18 1.00 -0.30 -0.46logMiles -0.04 0.03 -0.30 1.00 0.42Fuel -0.26 0.47 -0.46 0.42 1.00> cor(fueldata$Dlic,fueldata$Fuel)[1] 0.4685063 5/35

Page 6: Linear Regression using R

Example

> pairs(fuel2001)

6/35

Page 7: Linear Regression using R

Simple Linear Regression

7/35

• We make n paired observations on two variables:• The objective is to test for a linear relationship be-tween them,• How to quantify a good fit? The least squares approach: Choose to minimize

1 1( , ), , ( , )n nx y x y

0 1

" "" "

i i i

random errorpredictable

y x

0 1( , ) ' β

20 1

1

( ) ( )n

i ii

L y x

Page 8: Linear Regression using R

Simple Linear Regression

8/35

• is the sum of squared errors (SSE).• It is minimized by solving , and we have and • If we assume i.i.d. (identically & in-dependently) then it yields MLE (maximum likelihood estimates).

( )L

'( ) 0L

11

2

1

( )( )ˆ

( )

n

i ii

n

ii

x x y y

x x

0 1ˆ ˆy x

2~ (0, )i N

Page 9: Linear Regression using R

Simple Linear Regression

9/35

• Assumptions of the linear model

1. Errors ( 오차의 정규성 ). 2. Error variances are equal ( 오차의 등분산성 ). 3. Errors are independent ( 오차의 독립성 ). 4. Y has a linear dependence on X.

0 1i i iy x

2~ (0, )i N

Page 10: Linear Regression using R

Example

> library(alr3)> data(forbes)> forbes Temp Pressure Lpres1 194.5 20.79 131.79…17 212.2 30.06 147.80> g<-lm(Lpres ~Temp, data=forbes)> g Call:lm(formula = Lpres ~ Temp, data = forbes)Coefficients:(Intercept) Temp -42.1378 0.8955

10/35

> plot(forbes$Temp,forbes$Lpres)> abline(g$coeff,col="red")

Page 11: Linear Regression using R

Example

> summary(g)Call:lm(formula = Lpres ~ Temp, data = forbes)Residuals: Min 1Q Median 3Q Max -0.32220 -0.14473 -0.06664 0.02184 1.35978 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -42.13778 3.34020 -12.62 2.18e-09 ***Temp 0.89549 0.01645 54.43 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.379 on 15 degrees of freedomMultiple R-squared: 0.995, Adjusted R-squared: 0.9946 F-statistic: 2963 on 1 and 15 DF, p-value: < 2.2e-16 11/35

Page 12: Linear Regression using R

Multiple Linear Regression

12/35

• Assumptions of the linear model

1. Errors . 2. Error variances are equal. 3. Errors are independent. 4. Y has a linear dependence on X.2~ (0, )i N

ppxxXYE 110)|(

2)|( XYVar

Page 13: Linear Regression using R

Multiple Linear Regression

13/35

• Using the matrix representation,),0(~ 2NIDe j

npnpnn

p

p

n e

e

e

xxx

xxx

xxx

y

y

y

2

1

1

0

21

22221

11211

2

1

1

1

1

eXY

njexxy jjppjj ,,1,110

),0(~ 2nINe

Page 14: Linear Regression using R

Multiple Linear Regression

14/35

• The residual sum or squares• We can compute that minimizes by using the matrix representation . The OLS (ordinary least squares) estimates.

(matrix version of the normal equations)

2

1

( ) ( ' ) ( ) '( ) 'n

i ii

L y x Y X Y X e e

( ) ( ) '( ) ' ' ' ' ' ' ' ' ' 2 ' 'L Y X Y X Y Y X Y Y X X X Y Y X X X Y

ˆ

( ) ˆ| 2 ' 2 ' 0L

X X X Y

( )L

YXXX ')'(ˆ 1

Page 15: Linear Regression using R

Multiple Linear Regression

15/35

• To minimize SSE=e’e, we have X’e=0.

Page 16: Linear Regression using R

Multiple Linear Regression

16/35

• Fact : is an unbiased estimator of .• If e is normally distributed, • Define SSreg=SYY-SSE (SYY= the sum of squares of Y) As with the simple regression, the coefficient of determi-nation is It is also called the multiple correlation coefficient because it is the maximum of the correlation between Y and any linear combination of the terms in the mean function.

2ˆ ( )( 1)

SSEMSE

n p

2

22 2

ˆ( ( 1))~ ( ( 1))

SSE n pn p

2

2 1regSS SSER

SYY SYY

Page 17: Linear Regression using R

Example

> summary(lm(Fuel~.,data=fueldata)) # How can we analyze this results?Call:lm(formula = Fuel ~ ., data = fueldata)Residuals: Min 1Q Median 3Q Max -163.145 -33.039 5.895 31.989 183.499 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 154.1928 194.9062 0.791 0.432938 Tax -4.2280 2.0301 -2.083 0.042873 * Dlic 0.4719 0.1285 3.672 0.000626 ***Income -6.1353 2.1936 -2.797 0.007508 ** logMiles 18.5453 6.4722 2.865 0.006259 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 64.89 on 46 degrees of freedomMultiple R-squared: 0.5105, Adjusted R-squared: 0.4679 F-statistic: 11.99 on 4 and 46 DF, p-value: 9.33e-07 17/35

)1](1

[)1/(

)1/(1 222 R

pn

pR

nSYY

pnRSSRadj

Page 18: Linear Regression using R

t-test

18/35

• We want to test• Assume , then where • Since and

We have

ii

ii

bH

bH

:

:

1

0

),0(~ 2INe ),(~ˆ 2 iiii cN iiii XXc ])'[( 1

),(~ˆ 2 iiii cN )1(~)1( 222

pnMSEpnRSS

)1(~ˆ

pntMSEc

tii

ii

Page 19: Linear Regression using R

t-test

19/35

• Hypothesis concerning one of the terms• t-test statistic: • If H0 is true, ,so we reject H0 at level if

The confidence interval for is

ii

ii

bH

bH

:

:

1

0

)1(~ˆ

pntMSEc

bt

ii

ii

)1(~ pntt)1(2|| 2/ pntt

MSEctb iii 2/i

Page 20: Linear Regression using R

Example: t-test

> summary(lm(Fuel~.,data=fueldata))Call:lm(formula = Fuel ~ ., data = fueldata)Residuals: Min 1Q Median 3Q Max -163.145 -33.039 5.895 31.989 183.499 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 154.1928 194.9062 0.791 0.432938 Tax -4.2280 2.0301 -2.083 0.042873 * Dlic 0.4719 0.1285 3.672 0.000626 ***Income -6.1353 2.1936 -2.797 0.007508 ** logMiles 18.5453 6.4722 2.865 0.006259 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 64.89 on 46 degrees of freedomMultiple R-squared: 0.5105, Adjusted R-squared: 0.4679 F-statistic: 11.99 on 4 and 46 DF, p-value: 9.33e-07 20/35

Page 21: Linear Regression using R

F-test

21/35

• We refer to the full model with all the predictors as the complete model. The model containing only some of these predictors is called the reduced model. (nested with in the complete model)• Testing whether the complete model is identical to the reduced model is equivalent to testing whether the extra parameters in the complete model equal 0. (none of the extra variables increases the explained variability in Y)

Page 22: Linear Regression using R

F-test

22/35

• We may assume:• Hypothesis test for the reduced model• When H0 is true,

01

10

:

0:

HnotH

H pr

rr

pprr

xxxXYEreduced

xxxxXYEfull

110

110

)|(:

)|(:

222 FRFR SSESSESSESSE

)(~)1(~)1(~ 22

22

22

rpSSESSE

pnSSE

rnSSE FRFR

Page 23: Linear Regression using R

F-test

23/35

• Hypothesis test for the reduced model• F test statistic: • If H0 is true, so we reject H0 at level if• From this test, we conclude that the hypotheses are plausible or not. And we say that which model is ade-quate.

01

10

:

0:

HnotH

H pr

)1

/()(

pn

SSE

rp

SSESSEF FFR

)1,(~ pnrpFF)1,( pnrpFF

Page 24: Linear Regression using R

Example: F-test

> summary(lm(Fuel~.,data=fueldata))Call:lm(formula = Fuel ~ ., data = fueldata)Residuals: Min 1Q Median 3Q Max -163.145 -33.039 5.895 31.989 183.499 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 154.1928 194.9062 0.791 0.432938 Tax -4.2280 2.0301 -2.083 0.042873 * Dlic 0.4719 0.1285 3.672 0.000626 ***Income -6.1353 2.1936 -2.797 0.007508 ** logMiles 18.5453 6.4722 2.865 0.006259 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 64.89 on 46 degrees of freedomMultiple R-squared: 0.5105, Adjusted R-squared: 0.4679 F-statistic: 11.99 on 4 and 46 DF, p-value: 9.33e-07 24/35

Page 25: Linear Regression using R

ANOVA

25/35

• In the analysis of variance, the mean function with all the termsis compared with the mean function that includes only an intercept.

• For the second case, and the residual sum of squares is SYY.• We have SSE<=SYY, and the difference between these twoSSreg=SYY-SSE explained by the larger mean func-tion that is not explained by the smaller mean func-tion.

ppxxxXYE 110)|(

0)|( xXYE

y0

Page 26: Linear Regression using R

ANOVA

26/35

• By F-test, we measure the goodness of fit of the re-gression model.Source df SS MS F P-

value

Regres-sion

p SSreg MSreg=SSreg / p

MSreg / MSE

Residual n-(p+1) SSE

Total n-1 SYY

( 1)

SSEMSE

n p

Page 27: Linear Regression using R

Example: ANOVA

> g1<-lm(Fuel~.-Tax,data=fueldata)> g2<-lm(Fuel~.,data=fueldata)> anova(g1,g2) # Full model vs Reduced modelAnalysis of Variance TableModel 1: Fuel ~ (Tax + Dlic + Income + logMiles) - TaxModel 2: Fuel ~ Tax + Dlic + Income + logMiles Res.Df RSS Df Sum of Sq F Pr(>F) 1 47 211964 2 46 193700 1 18264 4.3373 0.04287 *---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.• The p-value 0.04 (<0.05), we have modest evidence that the coeffi-cient for Tax is different from 0. This is called a partial F-test.

27/35

NHnotAH

NH

:

0: 1

Page 28: Linear Regression using R

Example: sequential ANOVA

28/35

> f0<-lm(Fuel~1,data=fueldata)> f1<-lm(Fuel~Dlic,data=fueldata)> f2<-lm(Fuel~Dlic+Tax,data=fueldata)> f3<-lm(Fuel~Dlic+Tax+Income,data=fueldata)> f4<-lm(Fuel~.,data=fueldata)

> anova(f0,f1,f2,f3,f4)Analysis of Variance TableModel 1: Fuel ~ 1Model 2: Fuel ~ DlicModel 3: Fuel ~ Dlic + TaxModel 4: Fuel ~ Dlic + Tax + IncomeModel 5: Fuel ~ Tax + Dlic + Income + logMiles Res.Df RSS Df Sum of Sq F Pr(>F) 1 50 395694 2 49 308840 1 86854 20.6262 4.019e-05 ***3 48 289681 1 19159 4.5498 0.0382899 * 4 47 228273 1 61408 14.5833 0.0003997 ***5 46 193700 1 34573 8.2104 0.0062592 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.

Page 29: Linear Regression using R

Variable Selection

29/35

• Usually, we don’t expect every candidate predictor to be related to response. We want to identify a useful subset of the variables. Forward selection Backward elimination Stepwise method• By using F-test, we can add or remove some variables to the model. The procedure ends when none of the candidate variables have a p-value smaller than the pre-specified value.

Page 30: Linear Regression using R

Multicollinearity

30/35

• When the independent variables are correlated among themselves, multicollinearity among them is said to exist.• Estimated regression coefficients vary widely when the independent variables are highly correlated.• Variable Inflation Factor (VIF): Large changes in the estimated regression coefficients when a predictor variable is added or deleted, or when an observation is altered or deleted.

Page 31: Linear Regression using R

Multicollinearity

31/35

• where is a coefficient of deter-mination when is regressed on the other X variables.• VIFs measure how much the variances of the esti-mated regression coefficients are inflated as com-pared to when the predictor variables are not linearly related.• Generally, a maximum VIF value in excess of 5~10 is taken as an indication of multicollinearity. > vif(lm(Fuel~.,data=fueldata)) Tax Dlic Income logMiles 1.010786 1.040992 1.132311 1.099395

2

1

1jj

VIFR

2jR

jX

Page 32: Linear Regression using R

Model Assumption

32/35

> par(mfrow=c(2,2))> plot(lm(Fuel~.,data=fueldata)))Check the model assumptions.1. 선형 모형 ?2. 오차의 정규성 ?3. 오차의 등분산성 ?4. 추정식에 많은 영향을 준 값 ?=> 이상값 (outlier) 검출

Page 33: Linear Regression using R

Residual Analysis

33/35

> result=lm(Fuel~.,data=fueldata)> plot(resid(result))> line1=sd(resid(result))*1.96> line2=sd(resid(result))*-1.96> abline(line1,0)> abline(line2,0)> abline(0,0)> par(mfrow=c(1,2))> boxplot(resid(result))> hist(resid(result))

Page 34: Linear Regression using R

Variable Transformation

34/35

• Box-Cox transformation Select minimizing SSE. (Generally, it is between -2 and 2) > boxcox(lm(Fuel~.,data=fueldata))

0 1 1 p pY x x

Page 35: Linear Regression using R

Regression Procedure

35/35

• 전처리▫ 데이터 분석▫ 설명변수의 다중 공선성 등을 통해 독립성 검사▫ 오차항의 정규성 조사▫ 잔차 분석을 통해 데이터의 선형성 , 오차의 등분산성 검사▫ Box-Cox 변환▫ 이상값 조사

• 회귀분석▫ Scatterplot 과 covariance ( 혹은 correlation) 행렬 조사▫ Full model 에 대한 회귀분석을 통해 각 변수의 t-test▫ 여러 가지 변수선택방법을 통해 상호 비교 후 최적 변수 선택▫ 모델 해석▫ 최종 후보 모델 제시