regression_3d
TRANSCRIPT
-
8/7/2019 regression_3D
1/50
1
Simple Linear RegressionSimple Linear Regression
and Correlationand Correlation
Chapter 17
-
8/7/2019 regression_3D
2/50
-
8/7/2019 regression_3D
3/50
3
17.2 The Model The first order linear model
y = dependent variable
x = independent variable
F0
= y-intercept
F1 = slope of the line
= error variable
IFF! xy 10
x
y
F Run
Rise F = Rise/Run
F0 andF1 are unknown,therefore, are estimatedfrom the data.
I
-
8/7/2019 regression_3D
4/50
4
17.3 Estimating the Coefficients The estimates are determined by
drawing a sample from the population of interest,
calculating sample statistics.
producing a straight line that cuts into the data.
The question is:
Which straight line fits best?
x
y
-
8/7/2019 regression_3D
5/50
53
3
The best line is the one that minimizesthe sum of squared vertical differencesbetween the points and the line.
41
1
4
(1,2)
2
2
(2,4)
(3,1.5)
Sum of squared differences = (2 - 1)2 + (4 - 2)2 +(1.5 - 3)2 +
(4,3.2)
(3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2
+ (4 - 2.5)2
+ (1.5 - 2.5)2
+ (3.2 - 2.5)2
= 3.99
2.5
Let us compare two lines
The second line is horizontal
The smaller the sum ofsquared differencesthe better the fit of the
line to the data.
-
8/7/2019 regression_3D
6/50
6
To calculate the estimates of the coefficients
that minimize the differences between the datapoints and the line, use the formulas:
xbybs
)Y,Xcov(b
10
2
x
1
!
!
The regression equation that estimates
the equation of the first order linear modelis:
xbby1
0!
-
8/7/2019 regression_3D
7/50
7
Example 17.1 Relationship between odometer
reading and a used cars selling price.
A car dealer wants to findthe relationship between
the odometer reading andthe selling price of used cars.
A random sample of 100cars is selected, and the data
recorded.
Find the regression line.
ar Od o met er Price
1 7 1
447 061
4 008
4 0862 57 5
5 31705 5784
6 34010 5359
. . .
. . .
. . .
Independent variable x
Dependent variable y
-
8/7/2019 regression_3D
8/50
8
Solution Solving by hand
To calculate b0 and b1 we need to calculate severalstatistics first;
;41.411,5y
;45.009,36x
!
!
256,356,11n
)yy)(xx(),cov(
688,528,431n
)xx(s
ii
2
i2x
!!
!!
where n = 100.
533,6)45.009,36)(0312.(41.5411xbyb
0312.688,528,43
256,356,1
s
),cov(b
10
2x
1
!!!
!!!
x0
312.533,6xbby
10 !!
-
8/7/2019 regression_3D
9/50
9
4500
5000
5500
6000
19000 29000 39000 49000
Odometer
Price
Using the computer (see file Xm17-0
1.xls)SUMMARY OUTPUT
Regression Statistics
Multiple R 0.806308
R Square 0.650132Adjusted 0.646562
Standard 151.5688
Observatio 100
ANOVA
df SS MS F ignificance
Regressio 1 4183528 4183528 182.1056 4.4435E-24
Residual 98 2251362 22973.09
Total 99 6434890
Coefficient tandard Erro t Stat P-value
Intercept 6533.383 84.51232 77.30687 1.22E-89
Odometer -0.03116 0.002309 -13.4947 4.44E-24
x0312.533,6y !
Tools > Data analysis > Regression > [Shade the y range and the x range] > K
-
8/7/2019 regression_3D
10/50
10
This is the slope of the line.
For each additional mile on the odometer,the price decreases by an average of $0.0312
4500
5000
5500
6000
19000 29000 39000 49000
Odometer
Price
x0312.533,6y !
The intercept is b0 = 6533.
6533
0 No data
Do not interpret the intercept as the rice of cars that have not been driven
-
8/7/2019 regression_3D
11/50
11
17.4
Error Variable: Required Conditions The errorIis a critical part of the regression model.
Four requirements involving the distribution ofI mustbe satisfied.
The probability distribution ofI is normal.
The mean ofI is zero: E(I) = 0.
The standard deviation ofI is WI for all values of x.
The set of errors associated with different values of y areall independent.
-
8/7/2019 regression_3D
12/50
12
From the first three assumptions we have:y is normally distributed with meanE(y) =F0 +F1x, and a constant standarddeviation WI
Q
F0 +F1x1
F0 +F1x2
F0 +F1x3E(y|x2)
E(y|x3)
x1 x2 x3
Q
E(y|x1)
Q
The standard deviation remains constant,
but the mean value changes with x
-
8/7/2019 regression_3D
13/50
13
17.5 Assessing the Model The least squares method will produce a
regression line whether or not there is a linear
relationship between x and y.
Consequently, it is important to assess how wellthe linear model fits the data.
Several methods are used to assess the model: Testing and/or estimating the coefficients.
Using descriptive measurements.
-
8/7/2019 regression_3D
14/50
14
This is the sum of differences between the pointsand the regression line.
It can serve as a measure of how well the line fits the
data.
This statistic plays a role in every statisticaltechnique we employ to assess the model.
2
x
2Y
s
)Y,Xcov(s)1n(SSE !
.)yy(SSEn
1i
2ii
!
!
Sum of squares for errors
-
8/7/2019 regression_3D
15/50
15
The mean error is equal to zero.
If WI is small the errors tend to be close to zero(close to the mean error). Then, the model fits the
data well.
Therefore, we can, use WI as a measure of thesuitability of using a linear model.
An unbiased estimator ofWI2
is given by sI2
2n
SSEs
EstimateofErrordardtanS
!I
Standard error of estimate
-
8/7/2019 regression_3D
16/50
16
Example 17.2
Calculate the standard error of estimate for example17.1, and describe what does it tell you about themodel fit?
Solution
6.15198
363,251,2
2n
SSEs
,Thus
363,252,2688,528,43
)256,356,1(
)999,64
(99s
),cov(
s)1n(SSE
999,6499
890,434,6
1n
)yy(s
2
2x
2
2ii2
!
!
!
!
!
!!
!
I
Calculated before
It is hard to assess the model basedon sI even when compared with themean value of y.
4.411,5y,6.151s !!I
-
8/7/2019 regression_3D
17/50
17
Testing the slope
When no linear relationship exists between twovariables, the regression line should be horizontal.
Linear relationship.Different inputs (x) yielddifferent outputs (y).
No linear relationship.Different inputs (x) yieldthe same output (y).
The slope is not equal to zero The slope is equal to zero
-
8/7/2019 regression_3D
18/50
18
We can draw inference aboutF1 from b1 by testingH
0:F1 = 0
H1:F1 = 0 (or < 0,or > 0)
The test statistic is
If the error variable is normally distributed, the statistic isStudent t distribution with d.f. = n-2.
1b
11
s
bt
F!
The standard error of b1.
2x
b
s)1n(
ss
1
! Iwhere
-
8/7/2019 regression_3D
19/50
19
Solution Solving by hand
To compute t we need the values of b1 and sb1.
49.1300231.
0312.s
bt
00231.688,528,43)(99(
6.151
s)1n(
ss
312.b
1
1
b
11
2x
b
1
!!F
!
!!
!
!
I
Using the computer
oe icients t n r rror t t t ue
ntercept 0 0
ometer 0 0 0 00 0
There is o erwhe ming e i ence to in erth t the o ometer re ing ects the
uction se ing price
-
8/7/2019 regression_3D
20/50
20
Coefficient of determination When we want to measure the strength of the linear
relationship, we use the coefficient of determination.
!!
2
2
22
22
)(
1)],[cov(
yy
SSERor
ss
YXR
iyx
-
8/7/2019 regression_3D
21/50
21
To understand the significance of this coefficientnote:
verall variability in y
The regression model
The error
-
8/7/2019 regression_3D
22/50
22
x1 x2
y1
y2
y
Two data points (x1,y1) and (x2,y2) of a certain sample are shown.
!2
22
1 )yy()yy(2
22
1 )yy()yy( 2
222
11 )yy()yy(
Total variation in y = Variation explained by theregression line)
+ Unexplained variation (error)
-
8/7/2019 regression_3D
23/50
23
R2 measures the proportion of the variation in ythat is explained by the variation in x.
!
!
!
2i
2i
2i
2i
2
)yy(
SSR
)yy(
SSE)yy(
)yy(
SSE1R
Variation in y = SSR + SSE
R2 takes on any value between zero and one.
R2 = 1: erfect match between the line and the data points.
R2 = 0: There are no linear relationship between x and y.
-
8/7/2019 regression_3D
24/50
-
8/7/2019 regression_3D
25/50
25
17.6 Finance Application: Market Model ne of the most important applications of linear
regression is the market model.
It is assumed that rate of return on a stock (R) islinearly related to the rate of return on the overallmarket.
R =F0 +F1Rm +IRate of return on a particular stock Rate of return on some ma or stock index
The beta coefficient measures how sensitive the stocks rateof return is to changes in the level of the overall market.
-
8/7/2019 regression_3D
26/50
26
Example 17.5 The market model
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.560079
R Square 0.313688
Adjusted 0.301855
Standard 0.063123
Observatio 60
ANOVA
df SS MS F gnificance
Regressio 1 0.10563 0.10563 26.50969 3.27E-06
Residual 58 0.231105 0.003985
Total 59 0.336734
Coefficient andard Err t Stat P-value
Intercept 0.012818 0.008223 1.558903 0.12446
TSE 0.887691 0.172409 5.148756 3.27E-06
Estimate the market model forNortel, a stock traded in theToronto Stock Exchange.
Data consisted of monthly
percentage return forNorteland monthly percentage returnfor all the stocks.
This is a measure of the stocksmarket related risk. In this sample,
for each 1% increase in the TSEreturn, the average increase inNortels return is .8877%.
This is a measure of the total risk embeddedin the Nortel stock, that is market-related.
Specifically, 31.37% of the variation in Nortelsreturn are explained by the variation in theTSEs returns.
-
8/7/2019 regression_3D
27/50
27
17.7 Using the Regression Equation
If we are satisfied with how well the model fitsthe data, we can use it to make predictions for y.
Illustration
redict the selling price of a three-year-old Tauruswith 40,000 miles on the odometer (Example 17.1).
285,5)000,40(0312.6533x0312.6533y !!!
Before using the regression model, we need toassess how well it fits the data.
If we are satisfied with how well the model fitsthe data, we can use it to make predictions for y.
Illustration
redict the selling price of a three-year-old Tauruswith 40,000 miles on the odometer (Example 17.1).
-
8/7/2019 regression_3D
28/50
28
rediction interval and confidence interval
Two intervals can be used to discover how closelythe predicted value will match the true value of y.
rediction interval - for a particular value of y, Confidence interval - for the expected value of y.
The confidence interval
s IE 2i
2g
2)xx(
)xx(n1sty
The prediction interval
s IE 2i
2g
2)xx(
)xx(n11sty
The prediction interval is wider than the confidence interval
-
8/7/2019 regression_3D
29/50
29
Example 17.6 interval estimates for the car
auction price
rovide an interval estimate for the bidding price on aFord Taurus with 40,000 miles on the odometer.
Solution The dealer would like to predict the price of a single car
The prediction interval(95%) =
s IE 2i
2g
2 )xx(
)xx(
n
1
1sty
303285,5
160,340,309,4
)009,36000,40(
100
11)6.151(984.1)]40000(0312.6533[
2
s!
s
t.025,98
-
8/7/2019 regression_3D
30/50
30
The car dealer wants to bid on a lot of 250 FordTauruses, where each car has been driven for about40,000 miles.
Solution
The dealer needs to estimate the mean price per car. The confidence interval (95%) =
s I2
i
2g
2
)xx(
)xx(
n
1sty
35285,5160,340,309,4
)009,36000,40(
100
1)6.151(984.1)]40000(0312.6533[
2
s!
s
-
8/7/2019 regression_3D
31/50
31
s IE2
i
2g
2
)xx(
)xx(
n
1sty
x
s IE 2i
2
2)xx(
2
n
1sty
s
IE 2i
2
2 )xx(
1
n
1sty
The effect of the given value of x on the interval
As xg moves away from x the interval becomeslonger. That is, the shortest interval is found at x.
2x 2x
1x)1x( ! 1x)1x( !
g10 xbby !
)1xx(y g !
)1xx(y g !
2x)2x( ! 2x)2x( !
1x 1x
The confidence intervalwhen xg = x
The confidence interval
when xg = 1x s
The confidence intervalwhen xg = 2x s
-
8/7/2019 regression_3D
32/50
32
17.8 Coefficient of correlation The coefficient of correlation is used to measure the
strength of association between two variables.
The coefficient values range between -1 and 1.
If r = -1 (negative association) or r = +1 (positiveassociation) every point falls on the regression line.
If r = 0 there is no linear pattern. The coefficient can be used to test for linear
relationship between two variables.
-
8/7/2019 regression_3D
33/50
33
Testing the coefficient of correlation When there are no linear relationship between twovariables, V = 0.
The hypotheses are:
H0: V = 0H1: V = 0
The test statistic is:
yx
2
ss
),cov(rbycalculated
ncorrelatiooftcoefficiensa pletheisrhere
r1
2n
rt
!
!
The statistic is Student t distributed
ith d.f. = n - 2, provided the variablesare bivariate nor ally distributed.
-
8/7/2019 regression_3D
34/50
34
Example 17.7 Testing for linear relationship
Test the coefficient of correlation to determine if linear
relationship exists in the data of example 17.1.
Solution We test H0: V = 0
H1: V 0.
Solving by hand:
The re ection region is|t| > tE/2,n-2 = t.025,98 = 1.984 or so.
The sample coefficient of
correlation r=cov(X,Y)/sxsy=-.806
The value of the t statistic is
Conclusion: There is sufficientevidence at E = 5% to infer thatthere are linear relationshipbetween the two variables.
49.13r1
2nrt
2!
!{
-
8/7/2019 regression_3D
35/50
35
Spearman rank correlation coefficient
The Spearman rank test is used to test whetherrelationship exists between variables in cases where
at least one variable is ranked, or both variables are quantitative but the normality
requirement is not satisfied
-
8/7/2019 regression_3D
36/50
36
The hypotheses are:
H0: Vs = 0
H1: Vs = 0
The test statistic is
a and b are the ranks of the data.
For a large sample (n > 30) rs
is approximatelynormally distributed
bas
ss)b,acov(r !
1nrz s !
-
8/7/2019 regression_3D
37/50
37
Example 17.8
A production manager wants to examine therelationship between
aptitude test score given prior to hiring, and performance rating three months after starting work.
A random sample of 20 production workers wasselected. The test scores as well as performance
rating was recorded.
-
8/7/2019 regression_3D
38/50
38
Apt itude Performance
Employee test rat ing
1 59 3
2 47 2
3 58 4
4 66 3
5 77 2
. . .
. . .
. . .
Solution
The problem ob ective isto analyze therelationship betweentwo variables.
Performance rating isranked.
The hypotheses are:
H0: Vs = 0
H1: Vs = 0
The test statistic is rs,
and the re ection regionis |rs| > rcritical (taken fromthe Spearman rankcorrelation table).
Scores range from 0 to 100 Scores range from 1 to 5
-
8/7/2019 regression_3D
39/50
39
Apt tude erfor ance
p oyee test rat ng
1 59 3
2 47 2
3 58 4
4 66 3
5 77 2
. . .
. . .
. . .
an a
9
3
8
14
2
.
.
.
an
1 .5
3.5
17
1 .5
3.5
.
.
.
T es are ro eny averag ng the
ran s.
So v ng y hand
an each var a e separate y.
Ca c te s = 5.92; sb =5.5 ; cov a,b) = 12.34
Thus rs = cov a,b)/[s sb] = .379.
The cr t c va ue forE = .05 and n = 20 is .450.
Conclusion:Do not re ect the null hypothesis. At 5% significancelevel there is insufficient evidence to infer that thetwo variable are related to one another.
-
8/7/2019 regression_3D
40/50
-
8/7/2019 regression_3D
41/50
41
Residual Analysis
Examining the residuals (or standardized residuals),we can identify violations of the required conditions
Example 17.1 - continued Nonnormality.
Use Excel to obtain the standardized residual histogram.
Examine the histogram and look for a bell shaped diagram with
mean close to zero.
-
8/7/2019 regression_3D
42/50
42
For each residual we calculate
the standard deviation as follows:
!
! I
2
2i
i
ir
)xx(
)xx(
n
1h
whereh1ssi
OUT UT
O ser i n esi s n r esi s
1 . 2 .
2 . 2 2 . 1 1
. 0 -0. 21 21
22 .20 0 1. 01 0 12
2 . 0 1 1. 1 2
0
10
20
30
40
2 5 1 5 0 5 0 5 1 5 2 5 o e
A artial list of
Standard residuals
Standardized residual i =Residual i / Standard deviation
We can also apply the Lilliefors testor the G2 test of normality.
-
8/7/2019 regression_3D
43/50
43
Heteroscedasticity
When the requirement of a constant variance isviolated we have heteroscedasticity.
+ + +
+
+ ++
++
+
+
++
+
+
++
+
+
++
+
+
+
The spread increases with y
y^
Residualy
+
+++
+
++
+
+
+
+
+++
+
+
+
+
+
+
+
+
+
-
8/7/2019 regression_3D
44/50
44
+
+
+
+
++
+
++
+
+
++
+
+
++
+
+
++
+
+
+y^
Residualy
+
+++
+
++
+
+
+
+
+++
+
+
+
+
+
+
+
+
+
+
+++
+
++
+
+++
+
++
+
+++
+
++
The spread of the data pointsdoes not change much.
When the requirement of a constant variance isnot violated we have homoscedasticity.
-
8/7/2019 regression_3D
45/50
45
+
+
+
+
++
+
++
+
+
++
+
+
++
+
+
++
+
+
+y^
Residualy
+
+
+
+
+++
+
+
+
+
+
+
+
+
++
+
+
+
++
+
++
+
+
+
++ +
+
+
+++
+
++
As far as the even spread, this isa much better situation
When the requirement of a constant variance isnot violated we have homoscedasticity.
-
8/7/2019 regression_3D
46/50
46
N
onindependence of error variables
A time series is constituted if data were collectedover time.
Examining the residuals over time, no pattern shouldbe observed if the errors are independent.
When a pattern is detected, the errors are said to beautocorrelated.
Autocorrelation can be detected by graphing theresiduals against time.
-
8/7/2019 regression_3D
47/50
47
+
+++ +
++
++
++
++ +
++
++
+
+
+
+
+
+
+Time
Residual Residual
Time
+
+
+
atterns in the appearance of the residuals
over time indicates that autocorrelation exists.
Note the runs of positive residuals,replaced by runs of negative residuals
Note the oscillating behavior of theresiduals around zero.
0 0
-
8/7/2019 regression_3D
48/50
48
utliers An outlier is an observation that is unusually small or
large.
Several possibilities need to be investigated when anoutlier is observed:
There was an error in recording the value.
The point does not belong in the sample.
The observation is valid. Identify outliers from the scatter diagram.
It is customary to suspect an observation is an outlier ifits |standard residual| > 2
-
8/7/2019 regression_3D
49/50
49
+
+
+
+
+ +
+ + ++
+
+
+
+
+
+
+
The outlier causes a shiftin the regression line
but, some outliersmay be very influential
++++++++++
An outlier An influential observation
-
8/7/2019 regression_3D
50/50