regression_3d

8/7/2019 regression_3D

1/50

1

Simple Linear RegressionSimple Linear Regression

and Correlationand Correlation

Chapter 17


2/50


3/50

3

17.2 The Model The first order linear model

y = dependent variable

x = independent variable

F0

= y-intercept

F1 = slope of the line

= error variable

IFF! xy 10

x

y

F Run

Rise F = Rise/Run

F0 andF1 are unknown,therefore, are estimatedfrom the data.

I


4/50

4

17.3 Estimating the Coefficients The estimates are determined by

drawing a sample from the population of interest,

calculating sample statistics.

producing a straight line that cuts into the data.

The question is:

Which straight line fits best?

x

y


5/50

53

3

The best line is the one that minimizesthe sum of squared vertical differencesbetween the points and the line.

41

1

4

(1,2)

2

2

(2,4)

(3,1.5)

Sum of squared differences = (2 - 1)2 + (4 - 2)2 +(1.5 - 3)2 +

(4,3.2)

(3.2 - 4)2 = 6.89

Sum of squared differences = (2 -2.5)2

+ (4 - 2.5)2

+ (1.5 - 2.5)2

+ (3.2 - 2.5)2

= 3.99

2.5

Let us compare two lines

The second line is horizontal

The smaller the sum ofsquared differencesthe better the fit of the

line to the data.


6/50

6

To calculate the estimates of the coefficients

that minimize the differences between the datapoints and the line, use the formulas:

xbybs

)Y,Xcov(b

10

2

x

1

!

!

The regression equation that estimates

the equation of the first order linear modelis:

xbby1

0!


7/50

7

Example 17.1 Relationship between odometer

reading and a used cars selling price.

A car dealer wants to findthe relationship between

the odometer reading andthe selling price of used cars.

A random sample of 100cars is selected, and the data

recorded.

Find the regression line.

ar Od o met er Price

1 7 1

447 061

4 008

4 0862 57 5

5 31705 5784

6 34010 5359

. . .

. . .

. . .

Independent variable x

Dependent variable y


8/50

8

Solution Solving by hand

To calculate b0 and b1 we need to calculate severalstatistics first;

;41.411,5y

;45.009,36x

!

!

256,356,11n

)yy)(xx(),cov(

688,528,431n

)xx(s

ii

2

i2x

!!

!!

where n = 100.

533,6)45.009,36)(0312.(41.5411xbyb

0312.688,528,43

256,356,1

s

),cov(b

10

2x

1

!!!

!!!

x0

312.533,6xbby

10 !!


9/50

9

4500

5000

5500

6000

19000 29000 39000 49000

Odometer

Price

Using the computer (see file Xm17-0

1.xls)SUMMARY OUTPUT

Regression Statistics

Multiple R 0.806308

R Square 0.650132Adjusted 0.646562

Standard 151.5688

Observatio 100

ANOVA

df SS MS F ignificance

Regressio 1 4183528 4183528 182.1056 4.4435E-24

Residual 98 2251362 22973.09

Total 99 6434890

Coefficient tandard Erro t Stat P-value

Intercept 6533.383 84.51232 77.30687 1.22E-89

Odometer -0.03116 0.002309 -13.4947 4.44E-24

x0312.533,6y !

Tools > Data analysis > Regression > [Shade the y range and the x range] > K


10/50

10

This is the slope of the line.

For each additional mile on the odometer,the price decreases by an average of $0.0312

4500

5000

5500

6000

19000 29000 39000 49000

Odometer

Price

x0312.533,6y !

The intercept is b0 = 6533.

6533

0 No data

Do not interpret the intercept as the rice of cars that have not been driven


11/50

11

17.4

Error Variable: Required Conditions The errorIis a critical part of the regression model.

Four requirements involving the distribution ofI mustbe satisfied.

The probability distribution ofI is normal.

The mean ofI is zero: E(I) = 0.

The standard deviation ofI is WI for all values of x.

The set of errors associated with different values of y areall independent.


12/50

12

From the first three assumptions we have:y is normally distributed with meanE(y) =F0 +F1x, and a constant standarddeviation WI

Q

F0 +F1x1

F0 +F1x2

F0 +F1x3E(y|x2)

E(y|x3)

x1 x2 x3

Q

E(y|x1)

Q

The standard deviation remains constant,

but the mean value changes with x


13/50

13

17.5 Assessing the Model The least squares method will produce a

regression line whether or not there is a linear

relationship between x and y.

Consequently, it is important to assess how wellthe linear model fits the data.

Several methods are used to assess the model: Testing and/or estimating the coefficients.

Using descriptive measurements.


14/50

14

This is the sum of differences between the pointsand the regression line.

It can serve as a measure of how well the line fits the

data.

This statistic plays a role in every statisticaltechnique we employ to assess the model.

2

x

2Y

s

)Y,Xcov(s)1n(SSE !

.)yy(SSEn

1i

2ii

!

!

Sum of squares for errors


15/50

15

The mean error is equal to zero.

If WI is small the errors tend to be close to zero(close to the mean error). Then, the model fits the

data well.

Therefore, we can, use WI as a measure of thesuitability of using a linear model.

An unbiased estimator ofWI2

is given by sI2

2n

SSEs

EstimateofErrordardtanS

!I

Standard error of estimate


16/50

16

Example 17.2

Calculate the standard error of estimate for example17.1, and describe what does it tell you about themodel fit?

Solution

6.15198

363,251,2

2n

SSEs

,Thus

363,252,2688,528,43

)256,356,1(

)999,64

(99s

),cov(

s)1n(SSE

999,6499

890,434,6

1n

)yy(s

2

2x

2

2ii2

!

!

!

!

!

!!

!

I

Calculated before

It is hard to assess the model basedon sI even when compared with themean value of y.

4.411,5y,6.151s !!I


17/50

17

Testing the slope

When no linear relationship exists between twovariables, the regression line should be horizontal.

Linear relationship.Different inputs (x) yielddifferent outputs (y).

No linear relationship.Different inputs (x) yieldthe same output (y).

The slope is not equal to zero The slope is equal to zero


18/50

18

We can draw inference aboutF1 from b1 by testingH

0:F1 = 0

H1:F1 = 0 (or < 0,or > 0)

The test statistic is

If the error variable is normally distributed, the statistic isStudent t distribution with d.f. = n-2.

1b

11

s

bt

F!

The standard error of b1.

2x

b

s)1n(

ss

1

! Iwhere


19/50

19

Solution Solving by hand

To compute t we need the values of b1 and sb1.

49.1300231.

0312.s

bt

00231.688,528,43)(99(

6.151

s)1n(

ss

312.b

1

1

b

11

2x

b

1

!!F

!

!!

!

!

I

Using the computer

oe icients t n r rror t t t ue

ntercept 0 0

ometer 0 0 0 00 0

There is o erwhe ming e i ence to in erth t the o ometer re ing ects the

uction se ing price


20/50

20

Coefficient of determination When we want to measure the strength of the linear

relationship, we use the coefficient of determination.

!!

2

2

22

22

)(

1)],[cov(

yy

SSERor

ss

YXR

iyx


21/50

21

To understand the significance of this coefficientnote:

verall variability in y

The regression model

The error


22/50

22

x1 x2

y1

y2

y

Two data points (x1,y1) and (x2,y2) of a certain sample are shown.

!2

22

1 )yy()yy(2

22

1 )yy()yy( 2

222

11 )yy()yy(

Total variation in y = Variation explained by theregression line)

+ Unexplained variation (error)


23/50

23

R2 measures the proportion of the variation in ythat is explained by the variation in x.

!

!

!

2i

2i

2i

2i

2

)yy(

SSR

)yy(

SSE)yy(

)yy(

SSE1R

Variation in y = SSR + SSE

R2 takes on any value between zero and one.

R2 = 1: erfect match between the line and the data points.

R2 = 0: There are no linear relationship between x and y.


24/50


25/50

25

17.6 Finance Application: Market Model ne of the most important applications of linear

regression is the market model.

It is assumed that rate of return on a stock (R) islinearly related to the rate of return on the overallmarket.

R =F0 +F1Rm +IRate of return on a particular stock Rate of return on some ma or stock index

The beta coefficient measures how sensitive the stocks rateof return is to changes in the level of the overall market.


26/50

26

Example 17.5 The market model

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.560079

R Square 0.313688

Adjusted 0.301855

Standard 0.063123

Observatio 60

ANOVA

df SS MS F gnificance

Regressio 1 0.10563 0.10563 26.50969 3.27E-06

Residual 58 0.231105 0.003985

Total 59 0.336734

Coefficient andard Err t Stat P-value

Intercept 0.012818 0.008223 1.558903 0.12446

TSE 0.887691 0.172409 5.148756 3.27E-06

Estimate the market model forNortel, a stock traded in theToronto Stock Exchange.

Data consisted of monthly

percentage return forNorteland monthly percentage returnfor all the stocks.

This is a measure of the stocksmarket related risk. In this sample,

for each 1% increase in the TSEreturn, the average increase inNortels return is .8877%.

This is a measure of the total risk embeddedin the Nortel stock, that is market-related.

Specifically, 31.37% of the variation in Nortelsreturn are explained by the variation in theTSEs returns.


27/50

27

17.7 Using the Regression Equation

If we are satisfied with how well the model fitsthe data, we can use it to make predictions for y.

Illustration

redict the selling price of a three-year-old Tauruswith 40,000 miles on the odometer (Example 17.1).

285,5)000,40(0312.6533x0312.6533y !!!

Before using the regression model, we need toassess how well it fits the data.

If we are satisfied with how well the model fitsthe data, we can use it to make predictions for y.

Illustration

redict the selling price of a three-year-old Tauruswith 40,000 miles on the odometer (Example 17.1).


28/50

28

rediction interval and confidence interval

Two intervals can be used to discover how closelythe predicted value will match the true value of y.

rediction interval - for a particular value of y, Confidence interval - for the expected value of y.

The confidence interval

s IE 2i

2g

2)xx(

)xx(n1sty

The prediction interval

s IE 2i

2g

2)xx(

)xx(n11sty

The prediction interval is wider than the confidence interval


29/50

29

Example 17.6 interval estimates for the car

auction price

rovide an interval estimate for the bidding price on aFord Taurus with 40,000 miles on the odometer.

Solution The dealer would like to predict the price of a single car

The prediction interval(95%) =

s IE 2i

2g

2 )xx(

)xx(

n

1

1sty

303285,5

160,340,309,4

)009,36000,40(

100

11)6.151(984.1)]40000(0312.6533[

2

s!

s

t.025,98


30/50

30

The car dealer wants to bid on a lot of 250 FordTauruses, where each car has been driven for about40,000 miles.

Solution

The dealer needs to estimate the mean price per car. The confidence interval (95%) =

s I2

i

2g

2

)xx(

)xx(

n

1sty

35285,5160,340,309,4

)009,36000,40(

100

1)6.151(984.1)]40000(0312.6533[

2

s!

s


31/50

31

s IE2

i

2g

2

)xx(

)xx(

n

1sty

x

s IE 2i

2

2)xx(

2

n

1sty

s

IE 2i

2

2 )xx(

1

n

1sty

The effect of the given value of x on the interval

As xg moves away from x the interval becomeslonger. That is, the shortest interval is found at x.

2x 2x

1x)1x( ! 1x)1x( !

g10 xbby !

)1xx(y g !

)1xx(y g !

2x)2x( ! 2x)2x( !

1x 1x

The confidence intervalwhen xg = x

The confidence interval

when xg = 1x s

The confidence intervalwhen xg = 2x s


32/50

32

17.8 Coefficient of correlation The coefficient of correlation is used to measure the

strength of association between two variables.

The coefficient values range between -1 and 1.

If r = -1 (negative association) or r = +1 (positiveassociation) every point falls on the regression line.

If r = 0 there is no linear pattern. The coefficient can be used to test for linear

relationship between two variables.


33/50

33

Testing the coefficient of correlation When there are no linear relationship between twovariables, V = 0.

The hypotheses are:

H0: V = 0H1: V = 0

The test statistic is:

yx

2

ss

),cov(rbycalculated

ncorrelatiooftcoefficiensa pletheisrhere

r1

2n

rt

!

!

The statistic is Student t distributed

ith d.f. = n - 2, provided the variablesare bivariate nor ally distributed.


34/50

34

Example 17.7 Testing for linear relationship

Test the coefficient of correlation to determine if linear

relationship exists in the data of example 17.1.

Solution We test H0: V = 0

H1: V 0.

Solving by hand:

The re ection region is|t| > tE/2,n-2 = t.025,98 = 1.984 or so.

The sample coefficient of

correlation r=cov(X,Y)/sxsy=-.806

The value of the t statistic is

Conclusion: There is sufficientevidence at E = 5% to infer thatthere are linear relationshipbetween the two variables.

49.13r1

2nrt

2!

!{


35/50

35

Spearman rank correlation coefficient

The Spearman rank test is used to test whetherrelationship exists between variables in cases where

at least one variable is ranked, or both variables are quantitative but the normality

requirement is not satisfied


36/50

36

The hypotheses are:

H0: Vs = 0

H1: Vs = 0

The test statistic is

a and b are the ranks of the data.

For a large sample (n > 30) rs

is approximatelynormally distributed

bas

ss)b,acov(r !

1nrz s !


37/50

37

Example 17.8

A production manager wants to examine therelationship between

aptitude test score given prior to hiring, and performance rating three months after starting work.

A random sample of 20 production workers wasselected. The test scores as well as performance

rating was recorded.


38/50

38

Apt itude Performance

Employee test rat ing

1 59 3

2 47 2

3 58 4

4 66 3

5 77 2

. . .

. . .

. . .

Solution

The problem ob ective isto analyze therelationship betweentwo variables.

Performance rating isranked.

The hypotheses are:

H0: Vs = 0

H1: Vs = 0

The test statistic is rs,

and the re ection regionis |rs| > rcritical (taken fromthe Spearman rankcorrelation table).

Scores range from 0 to 100 Scores range from 1 to 5


39/50

39

Apt tude erfor ance

p oyee test rat ng

1 59 3

2 47 2

3 58 4

4 66 3

5 77 2

. . .

. . .

. . .

an a

9

3

8

14

2

.

.

.

an

1 .5

3.5

17

1 .5

3.5

.

.

.

T es are ro eny averag ng the

ran s.

So v ng y hand

an each var a e separate y.

Ca c te s = 5.92; sb =5.5 ; cov a,b) = 12.34

Thus rs = cov a,b)/[s sb] = .379.

The cr t c va ue forE = .05 and n = 20 is .450.

Conclusion:Do not re ect the null hypothesis. At 5% significancelevel there is insufficient evidence to infer that thetwo variable are related to one another.


40/50


41/50

41

Residual Analysis

Examining the residuals (or standardized residuals),we can identify violations of the required conditions

Example 17.1 - continued Nonnormality.

Use Excel to obtain the standardized residual histogram.

Examine the histogram and look for a bell shaped diagram with

mean close to zero.


42/50

42

For each residual we calculate

the standard deviation as follows:

!

! I

2

2i

i

ir

)xx(

)xx(

n

1h

whereh1ssi

OUT UT

O ser i n esi s n r esi s

1 . 2 .

2 . 2 2 . 1 1

. 0 -0. 21 21

22 .20 0 1. 01 0 12

2 . 0 1 1. 1 2

0

10

20

30

40

2 5 1 5 0 5 0 5 1 5 2 5 o e

A artial list of

Standard residuals

Standardized residual i =Residual i / Standard deviation

We can also apply the Lilliefors testor the G2 test of normality.


43/50

43

Heteroscedasticity

When the requirement of a constant variance isviolated we have heteroscedasticity.

+ + +

+

+ ++

++

+

+

++

+

+

++

+

+

++

+

+

+

The spread increases with y

y^

Residualy

+

+++

+

++

+

+

+

+

+++

+

+

+

+

+

+

+

+

+


44/50

44

+

+

+

+

++

+

++

+

+

++

+

+

++

+

+

++

+

+

+y^

Residualy

+

+++

+

++

+

+

+

+

+++

+

+

+

+

+

+

+

+

+

+

+++

+

++

+

+++

+

++

+

+++

+

++

The spread of the data pointsdoes not change much.

When the requirement of a constant variance isnot violated we have homoscedasticity.


45/50

45

+

+

+

+

++

+

++

+

+

++

+

+

++

+

+

++

+

+

+y^

Residualy

+

+

+

+

+++

+

+

+

+

+

+

+

+

++

+

+

+

++

+

++

+

+

+

++ +

+

+

+++

+

++

As far as the even spread, this isa much better situation

When the requirement of a constant variance isnot violated we have homoscedasticity.


46/50

46

N

onindependence of error variables

A time series is constituted if data were collectedover time.

Examining the residuals over time, no pattern shouldbe observed if the errors are independent.

When a pattern is detected, the errors are said to beautocorrelated.

Autocorrelation can be detected by graphing theresiduals against time.


47/50

47

+

+++ +

++

++

++

++ +

++

++

+

+

+

+

+

+

+Time

Residual Residual

Time

+

+

+

atterns in the appearance of the residuals

over time indicates that autocorrelation exists.

Note the runs of positive residuals,replaced by runs of negative residuals

Note the oscillating behavior of theresiduals around zero.

0 0


48/50

48

utliers An outlier is an observation that is unusually small or

large.

Several possibilities need to be investigated when anoutlier is observed:

There was an error in recording the value.

The point does not belong in the sample.

The observation is valid. Identify outliers from the scatter diagram.

It is customary to suspect an observation is an outlier ifits |standard residual| > 2


49/50

49

+

+

+

+

+ +

+ + ++

+

+

+

+

+

+

+

The outlier causes a shiftin the regression line

but, some outliersmay be very influential

++++++++++

An outlier An influential observation


50/50

regression_3d

Documents