linear regression
DESCRIPTION
Linear Regression. Hypothesis testing and Estimation. Assume that we have collected data on two variables X and Y. Let ( x 1 , y 1 ) ( x 2 , y 2 ) ( x 3 , y 3 ) … ( x n , y n ) denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population). - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/1.jpg)
Linear Regression
Hypothesis testing and Estimation
![Page 2: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/2.jpg)
Assume that we have collected data on two variables X and Y. Let
(x1, y1) (x2, y2) (x3, y3) … (xn, yn)
denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)
![Page 3: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/3.jpg)
The Statistical Model
![Page 4: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/4.jpg)
Each yi is assumed to be randomly generated from a normal distribution with
mean i = + xi and standard deviation . (, and are unknown)
yi
+ xi
xi
Y = + X
slope =
![Page 5: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/5.jpg)
The Data The Linear Regression Model
• The data falls roughly about a straight line.
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
Y = + X
unseen
![Page 6: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/6.jpg)
The Least Squares Line
Fitting the best straight line
to “linear” data
![Page 7: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/7.jpg)
LetY = a + b X
denote an arbitrary equation of a straight line.a and b are known values.This equation can be used to predict for each value of X, the value of Y.
For example, if X = xi (as for the ith case) then the predicted value of Y is:
ii bxay ˆ
![Page 8: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/8.jpg)
The residual
can be computed for each case in the sample,
The residual sum of squares (RSS) is
a measure of the “goodness of fit of the line
Y = a + bX to the data
iiiii bxayyyr ˆ
,ˆ,,ˆ,ˆ 222111 nnn yyryyryyr
n
iii
n
iii
n
ii bxayyyrRSS
1
2
1
2
1
2 ˆ
![Page 9: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/9.jpg)
The optimal choice of a and b will result in the residual sum of squares
attaining a minimum.
If this is the case than the line:
Y = a + bX
is called the Least Squares Line
n
iii
n
iii
n
ii bxayyyrRSS
1
2
1
2
1
2 ˆ
![Page 10: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/10.jpg)
The equation for the least squares line
Let
n
iixx xxS
1
2
n
iiyy yyS
1
2
n
iiixy yyxxS
1
![Page 11: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/11.jpg)
Linear Regression
Hypothesis testing and Estimation
![Page 12: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/12.jpg)
The Least Squares Line
Fitting the best straight line
to “linear” data
![Page 13: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/13.jpg)
n
x
xxxS
n
iin
ii
n
iixx
2
1
1
2
1
2
n
yx
yx
n
ii
n
iin
iii
11
1
n
y
yyyS
n
iin
ii
n
iiyy
2
1
1
2
1
2
n
iiixy yyxxS
1
Computing Formulae:
![Page 14: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/14.jpg)
Then the slope of the least squares line can be shown to be:
n
ii
n
iii
xx
xy
xx
yyxx
S
Sb
1
2
1
![Page 15: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/15.jpg)
and the intercept of the least squares line can be shown to be:
xS
Syxbya
xx
xy
![Page 16: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/16.jpg)
The residual sum of Squares
22
1 1
ˆn n
i i i ii i
RSS y y y a bx
2
xy
yyxx
SS
S
Computing formula
![Page 17: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/17.jpg)
Estimating , the standard deviation in the regression model :
22
ˆ1
2
1
2
n
bxay
n
yys
n
iii
n
iii
xx
xyyy S
SS
n
2
2
1
This estimate of is said to be based on n – 2 degrees of freedom
Computing formula
![Page 18: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/18.jpg)
Sampling distributions of the estimators
![Page 19: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/19.jpg)
The sampling distribution slope of the least squares line :
n
ii
n
iii
xx
xy
xx
yyxx
S
Sb
1
2
1
It can be shown that b has a normal distribution with mean and standard deviation
n
ii
xx
bb
xxS
1
2
and
![Page 20: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/20.jpg)
Thus
has a standard normal distribution, and
b
b
xx
b bz
S
b
b
xx
b bt
ssS
has a t distribution with df = n - 2
![Page 21: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/21.jpg)
(1 – )100% Confidence Limits for slope :
t/2 critical value for the t-distribution with n – 2 degrees of freedom
xxS
st ˆ
2/
![Page 22: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/22.jpg)
Testing the slope
The test statistic is:
0 0 0: vs : AH H
0
xx
bt
sS
- has a t distribution with df = n – 2 if H0 is true.
![Page 23: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/23.jpg)
The Critical Region
Reject
0 0 0: vs : AH H
0/ 2 / 2if or
xx
bt t t t
sS
df = n – 2
This is a two tailed tests. One tailed tests are also possible
![Page 24: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/24.jpg)
The sampling distribution intercept of the least squares line :
It can be shown that a has a normal distribution with mean and standard deviation
n
ii
aa
xx
x
n
1
2
21 and
xS
Syxbya
xx
xy
![Page 25: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/25.jpg)
Thus
has a standard normal distribution and
2
2
1
1
a
a
n
ii
a az
xn x x
2
2
1
1
a
a
n
ii
a at
s xs
n x x
has a t distribution with df = n - 2
![Page 26: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/26.jpg)
(1 – )100% Confidence Limits for intercept :
t/2 critical value for the t-distribution with n – 2 degrees of freedom
1
ˆ2
2/xxS
x
nst
![Page 27: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/27.jpg)
Testing the intercept
The test statistic is:
0 0 0: vs : AH H
- has a t distribution with df = n – 2 if H0 is true.
0
2
2
1
1
n
ii
at
xs
n x x
![Page 28: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/28.jpg)
The Critical Region
Reject
0 0 0: vs : AH H
0/ 2 / 2if or
a
at t t t
s
df = n – 2
![Page 29: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/29.jpg)
Example
![Page 30: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/30.jpg)
The following data showed the per capita consumption of cigarettes per month (X) in various countries in 1930, and the death rates from lung cancer for men in 1950. TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in 1950.
Country (i) Xi Yi
Australia 48 18Canada 50 15Denmark 38 17Finland 110 35Great Britain 110 46Holland 49 24Iceland 23 6Norway 25 9Sweden 30 11Switzerland 51 25USA 130 20
![Page 31: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/31.jpg)
Australia
CanadaDenmark
Finland
Great Britain
Holland
Iceland
NorwaySweden
Switzerland
USA
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140
deat
h ra
tes f
rom
lung
can
cer
(195
0)
Per capita consumption of cigarettes
![Page 32: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/32.jpg)
404,541
2
n
iix
914,161
n
iii yx
018,61
2
n
iiy
Fitting the Least Squares Line
6641
n
iix
2261
n
iiy
![Page 33: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/33.jpg)
55.1432211
66454404
2
xxS
73.1374
11
2266018
2
yyS
82.3271
11
22666416914 xyS
Fitting the Least Squares Line
First compute the following three quantities:
![Page 34: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/34.jpg)
Computing Estimate of Slope (), Intercept () and standard deviation (),
288.055.14322
82.3271
xx
xy
S
Sb
756.611
664288.0
11
226
xbya
35.8
2
12
xx
xyyy S
SS
ns
![Page 35: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/35.jpg)
95% Confidence Limits for slope :
t.025 = 2.262 critical value for the t-distribution with 9 degrees of freedom
xxS
st ˆ
2/
0.0706 to 0.3862
8.350.288 2.262
1432255
![Page 36: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/36.jpg)
95% Confidence Limits for intercept :
1
ˆ2
2/xxS
x
nst
-4.34 to 17.85
t.025 = 2.262 critical value for the t-distribution with 9 degrees of freedom
2664 111
6.756 2.262 8.35 11 1432255
![Page 37: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/37.jpg)
Iceland
NorwaySweden
DenmarkCanada
Australia
HollandSwitzerland
Great Britain
Finland
USA
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140
Per capita consumption of cigarettes
deat
h ra
tes
from
lung
can
cer
(195
0)
Y = 6.756 + (0.228)X
95% confidence Limits for slope 0.0706 to 0.3862
95% confidence Limits for intercept -4.34 to 17.85
![Page 38: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/38.jpg)
Testing the positive slope
The test statistic is:
0 : 0 vs : 0 AH H
0
xx
bt
sS
![Page 39: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/39.jpg)
The Critical Region
Reject
0 : 0 in favour of : 0 AH H
0.05
0if =1.833
xx
bt t
sS
df = 11 – 2 = 9
A one tailed test
![Page 40: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/40.jpg)
and conclude
0 : 0 H
0Since
xx
bt
sS
0.28841.3 1.833
8.351432255
we reject
: 0 AH
![Page 41: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/41.jpg)
Confidence Limits for Points on the Regression Line
• The intercept is a specific point on the regression line.
• It is the y – coordinate of the point on the regression line when x = 0.
• It is the predicted value of y when x = 0.• We may also be interested in other points on the
regression line. e.g. when x = x0
• In this case the y – coordinate of the point on the regression line when x = x0 is + x0
![Page 42: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/42.jpg)
x0
+ x0
y = + x
![Page 43: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/43.jpg)
(1- )100% Confidence Limits for + x0 :
1 20
2/0xxS
xx
nstbxa
t/2 is the /2 critical value for the t-distribution with n - 2 degrees of freedom
![Page 44: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/44.jpg)
Prediction Limits for new values of the Dependent variable y
• An important application of the regression line is prediction.
• Knowing the value of x (x0) what is the value of y?
• The predicted value of y when x = x0 is:
• This in turn can be estimated by:.
ˆ 0xy
00 ˆˆˆ bxaxy
![Page 45: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/45.jpg)
The predictor
• Gives only a single value for y. • A more appropriate piece of information would
be a range of values.• A range of values that has a fixed probability of
capturing the value for y.• A (1- )100% prediction interval for y.
00 ˆˆˆ bxaxy
![Page 46: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/46.jpg)
(1- )100% Prediction Limits for y when x = x0:
11
20
2/0xxS
xx
nstbxa
t/2 is the /2 critical value for the t-distribution with n - 2 degrees of freedom
![Page 47: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/47.jpg)
Example
In this example we are studying building fires in a city and interested in the relationship between:
1. X = the distance of the closest fire hall and the building that puts out the alarm
and
2. Y = cost of the damage (1000$)
The data was collected on n = 15 fires.
![Page 48: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/48.jpg)
The DataFire Distance Damage
1 3.4 26.22 1.8 17.83 4.6 31.34 2.3 23.15 3.1 27.56 5.5 36.07 0.7 14.18 3.0 22.39 2.6 19.610 4.3 31.311 2.1 24.012 1.1 17.313 6.1 43.214 4.8 36.415 3.8 26.1
![Page 49: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/49.jpg)
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
0.0 2.0 4.0 6.0 8.0
Distance (miles)
Dam
age
(100
0$)
Scatter Plot
![Page 50: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/50.jpg)
Computations
Fire Distance Damage
1 3.4 26.22 1.8 17.83 4.6 31.34 2.3 23.15 3.1 27.56 5.5 36.07 0.7 14.18 3.0 22.39 2.6 19.6
10 4.3 31.311 2.1 24.012 1.1 17.313 6.1 43.214 4.8 36.415 3.8 26.1
2.491
n
iix
2.3961
n
iiy
16.1961
2
n
iix
5.113761
2
n
iiy
65.14701
n
iii yx
![Page 51: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/51.jpg)
Computations Continued
28.3152.491
n
xx
n
ii
4133.26152.3961
n
yy
n
ii
![Page 52: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/52.jpg)
Computations Continued
784.34152.4916.196
2
2
1
1
2
n
xxS
n
iin
iixx
517.911152.3965.11376
2
2
1
1
2
n
yyS
n
iin
iiyy
n
yxyxS
n
ii
n
iin
iiixy
11
1
114.171152.3962.4965.1470
![Page 53: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/53.jpg)
Computations Continued
92.4784.34
114.171ˆ xx
xy
S
Sb
28.1028.3919.44133.26ˆ xbya
2
2
n
SS
Ss xx
xyyy
316.213
784.34114.171517.911
2
![Page 54: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/54.jpg)
95% Confidence Limits for slope :
t.025 = 2.160 critical value for the t-distribution with 13 degrees of freedom
xxS
st ˆ
2/
4.07 to 5.77
![Page 55: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/55.jpg)
95% Confidence Limits for intercept :
1
ˆ2
2/xxS
x
nst
7.21 to 13.35
t.025 = 2.160 critical value for the t-distribution with 13 degrees of freedom
![Page 56: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/56.jpg)
0.0
10.0
20.0
30.0
40.0
50.0
60.0
0.0 2.0 4.0 6.0 8.0
Distance (miles)
Dam
age
(100
0$)
Least Squares Line
y=4.92x+10.28
![Page 57: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/57.jpg)
(1- )100% Confidence Limits for + x0 :
1 20
2/0xxS
xx
nstbxa
t/2 is the /2 critical value for the t-distribution with n - 2 degrees of freedom
![Page 58: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/58.jpg)
95% Confidence Limits for + x0 :
x 0 lower upper
1 12.87 17.522 18.43 21.803 23.72 26.354 28.53 31.385 32.93 36.826 37.15 42.44
![Page 59: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/59.jpg)
0.0
10.0
20.0
30.0
40.0
50.0
60.0
0.0 2.0 4.0 6.0 8.0
Distance (miles)
Dam
age
(100
0$)
95% Confidence Limits for + x0
Confidence limits
![Page 60: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/60.jpg)
(1- )100% Prediction Limits for y when x = x0:
11
20
2/0xxS
xx
nstbxa
t/2 is the /2 critical value for the t-distribution with n - 2 degrees of freedom
![Page 61: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/61.jpg)
95% Prediction Limits for y when x = x0
x 0 lower upper
1 9.68 20.712 14.84 25.403 19.86 30.214 24.75 35.165 29.51 40.246 34.13 45.45
![Page 62: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/62.jpg)
0.0
10.0
20.0
30.0
40.0
50.0
60.0
0.0 2.0 4.0 6.0 8.0
Distance (miles)
Dam
age
(100
0$)
95% Prediction Limits for y when x =x0
Prediction limits
![Page 63: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/63.jpg)
Linear RegressionSummary
Hypothesis testing and Estimation
![Page 64: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/64.jpg)
(1 – )100% Confidence Limits for slope :
t/2 critical value for the t-distribution with n – 2 degrees of freedom
xxS
st ˆ
2/
![Page 65: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/65.jpg)
Testing the slope
The test statistic is:
0 0 0: vs : AH H
0
xx
bt
sS
- has a t distribution with df = n – 2 if H0 is true.
![Page 66: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/66.jpg)
(1 – )100% Confidence Limits for intercept :
t/2 critical value for the t-distribution with n – 2 degrees of freedom
1
ˆ2
2/xxS
x
nst
![Page 67: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/67.jpg)
Testing the intercept
The test statistic is:
0 0 0: vs : AH H
- has a t distribution with df = n – 2 if H0 is true.
0
2
2
1
1
n
ii
at
xs
n x x
![Page 68: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/68.jpg)
(1- )100% Confidence Limits for + x0 :
1 20
2/0xxS
xx
nstbxa
t/2 is the /2 critical value for the t-distribution with n - 2 degrees of freedom
![Page 69: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/69.jpg)
(1- )100% Prediction Limits for y when x = x0:
11
20
2/0xxS
xx
nstbxa
t/2 is the /2 critical value for the t-distribution with n - 2 degrees of freedom
![Page 70: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/70.jpg)
Correlation
![Page 71: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/71.jpg)
The statistic:
Definition
n
ii
n
ii
n
iii
yyxx
xy
yyxx
yyxx
SS
Sr
1
2
1
2
1
is called Pearsons correlation coefficient
![Page 72: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/72.jpg)
1. -1 ≤ r ≤ 1, |r| ≤ 1, r2 ≤ 1
2. |r| = 1 (r = +1 or -1) if the points
(x1, y1), (x2, y2), …, (xn, yn) lie along a straight line. (positive slope for +1, negative slope for -1)
Properties
![Page 73: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/73.jpg)
The test for independence (zero correlation)
The test statistic:
22
1
rt n
r
Reject H0 if |t| > ta/2 (df = n – 2)
H0: X and Y are independent
HA: X and Y are correlated
The Critical region
This is a two-tailed critical region, the critical region could also be one-tailed
![Page 74: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/74.jpg)
Example
In this example we are studying building fires in a city and interested in the relationship between:
1. X = the distance of the closest fire hall and the building that puts out the alarm
and
2. Y = cost of the damage (1000$)
The data was collected on n = 15 fires.
![Page 75: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/75.jpg)
The DataFire Distance Damage
1 3.4 26.22 1.8 17.83 4.6 31.34 2.3 23.15 3.1 27.56 5.5 36.07 0.7 14.18 3.0 22.39 2.6 19.610 4.3 31.311 2.1 24.012 1.1 17.313 6.1 43.214 4.8 36.415 3.8 26.1
![Page 76: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/76.jpg)
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
0.0 2.0 4.0 6.0 8.0
Distance (miles)
Dam
age
(100
0$)
Scatter Plot
![Page 77: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/77.jpg)
Computations
Fire Distance Damage
1 3.4 26.22 1.8 17.83 4.6 31.34 2.3 23.15 3.1 27.56 5.5 36.07 0.7 14.18 3.0 22.39 2.6 19.6
10 4.3 31.311 2.1 24.012 1.1 17.313 6.1 43.214 4.8 36.415 3.8 26.1
2.491
n
iix
2.3961
n
iiy
16.1961
2
n
iix
5.113761
2
n
iiy
65.14701
n
iii yx
![Page 78: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/78.jpg)
Computations Continued
28.3152.491
n
xx
n
ii
4133.26152.3961
n
yy
n
ii
![Page 79: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/79.jpg)
Computations Continued
784.34152.4916.196
2
2
1
1
2
n
xxS
n
iin
iixx
517.911152.3965.11376
2
2
1
1
2
n
yyS
n
iin
iiyy
n
yxyxS
n
ii
n
iin
iiixy
11
1
114.171152.3962.4965.1470
![Page 80: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/80.jpg)
The correlation coefficient
171.1140.961
34.784 911.517xy
xx yy
Sr
S S
The test for independence (zero correlation)
The test statistic:
2 2
0.9612 13 12.525
1 1 0.961
rt n
r
We reject H0: independence, if |t| > t0.025 = 2.160
H0: independence, is rejected
![Page 81: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/81.jpg)
Relationship between Regression and Correlation
![Page 82: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/82.jpg)
Recall xy
xx yy
Sr
S S
Also
ˆ xy yy xy yy y
xx xx xx xxx yy
S S S S sr r
S S S sS S
since and 1 1
yyxxx y
SSs s
n n
Thus the slope of the least squares line is simply the ratio of the standard deviations × the correlation coefficient
![Page 83: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/83.jpg)
The test for independence (zero correlation)
Uses the test statistic:
22
1
rt n
r
H0: X and Y are independent
HA: X and Y are correlated
Note: andˆ yy
xx
Sr
S ˆxx
yy
Sr
S
![Page 84: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/84.jpg)
1. The test for independence (zero correlation)H0: X and Y are independent
HA: X and Y are correlated
are equivalent
The two tests
2. The test for zero slopeH0: = 0.
HA: ≠ 0
![Page 85: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/85.jpg)
1. the test statistic for independence:
22
1
rt n
r
2 2 2 2
1 1
xy xy
xx yy xx
xy xyyy
xx yy xx yy
S S
S S St n n
S SS
S S S S
Thus
2
ˆ
12
the same statistic for testing for slope.
xy
xx
xyyy xx
xxxx
S
SsS
S n SSS
zero
![Page 86: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/86.jpg)
Regression (in general)
![Page 87: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/87.jpg)
In many experiments we would have collected data on a single variable Y (the dependent variable ) and on p (say) other variables X1, X2, X3, ... , Xp (the independent variables). One is interested in determining a model that describes the relationship between Y (the response (dependent) variable) and X1, X2, …, Xp (the predictor (independent) variables.
This model can be used for– Prediction– Controlling Y by manipulating X1, X2, …, Xp
![Page 88: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/88.jpg)
The Model:is an equation of the form
Y = f(X1, X2,... ,Xp | 1, 2, ... , q) +
where 1, 2, ... , q are unknown parameters of the function f and is a random disturbance (usually assumed to have a normal distribution with mean 0 and standard deviation .
![Page 89: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/89.jpg)
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
Examples:
1. Y = Blood Pressure, X = age
The model
Y = + X + thus 1 = and 2 = .
This model is called:
the simple Linear Regression Model
Y = + X
![Page 90: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/90.jpg)
8
8.5
9
9.5
10
10.5
11
11.5
12
12.5
1930 1940 1950 1960 1970 1980 1990 2000 2010
2. Y = average of five best times for running the 100m, X = the year
The model
Y = e-X + thus 1 = 2 = and 2 =
.
This model is called:
the exponential Regression Model
Y = e-X +
![Page 91: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/91.jpg)
2. Y = gas mileage ( mpg) of a car brand
X1 = engine size
X2 = horsepower
X3 = weight
The model
Y = 0 + 1 X1 + 2 X2 + 3 X3 + .
This model is called:
the Multiple Linear Regression Model
![Page 92: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/92.jpg)
The Multiple Linear Regression Model
![Page 93: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/93.jpg)
In Multiple Linear Regression we assume the following model
Y = 0 + 1 X1 + 2 X2 + ... + p Xp +
This model is called the Multiple Linear Regression Model.
Again are unknown parameters of the model and where0, 1, 2, ... , p are unknown parameters and
is a random disturbance assumed to have a normal distribution with mean 0 and standard deviation .
![Page 94: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/94.jpg)
The importance of the Linear model
1. It is the simplest form of a model in which each dependent variable has some effect on the independent variable Y. – When fitting models to data one tries to find the
simplest form of a model that still adequately describes the relationship between the dependent variable and the independent variables.
– The linear model is sometimes the first model to be fitted and only abandoned if it turns out to be inadequate.
![Page 95: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/95.jpg)
2. In many instance a linear model is the most appropriate model to describe the dependence relationship between the dependent variable and the independent variables.
– This will be true if the dependent variable increases at a constant rate as any or the independent variables is increased while holding the other independent variables constant.
![Page 96: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/96.jpg)
3. Many non-Linear models can be Linearized (put into the form of a Linear model by appropriately transformation the dependent variables and/or any or all of the independent variables.) – This important fact ensures the wide utility of
the Linear model. (i.e. the fact the many non-linear models are linearizable.)
![Page 97: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/97.jpg)
An Example
The following data comes from an experiment that was interested in investigating the source from which corn plants in various soils obtain their phosphorous.
–The concentration of inorganic phosphorous (X1) and
the concentration of organic phosphorous (X2) was
measured in the soil of n = 18 test plots.–In addition the phosphorous content (Y) of corn grown in the soil was also measured. The data is displayed below:
![Page 98: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/98.jpg)
InorganicPhosphorous
X1
OrganicPhosphorous
X2
Plant Available
PhosphorousY
InorganicPhosphorous
X1
OrganicPhosphorous
X2
Plant Available
Phosphorous
Y
0.4 53 64 12.6 58 51
0.4 23 60 10.9 37 76
3.1 19 71 23.1 46 96
0.6 34 61 23.1 50 77
4.7 24 54 21.6 44 93
1.7 65 77 23.1 56 95
9.4 44 81 1.9 36 54
10.1 31 93 26.8 58 168
11.6 29 93 29.9 51 99
![Page 99: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/99.jpg)
Coefficients
Intercept 56.2510241 (0)
X1 1.78977412 (1)
X2 0.08664925 (2)
Equation:
Y = 56.2510241 + 1.78977412 X1 + 0.08664925 X2
![Page 100: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/100.jpg)
![Page 101: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/101.jpg)
The Multiple Linear Regression Model
![Page 102: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/102.jpg)
In Multiple Linear Regression we assume the following model
Y = 0 + 1 X1 + 2 X2 + ... + p Xp +
This model is called the Multiple Linear Regression Model.
Again are unknown parameters of the model and where0, 1, 2, ... , p are unknown parameters and
is a random disturbance assumed to have a normal distribution with mean 0 and standard deviation .
![Page 103: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/103.jpg)
Summary of the Statistics used in
Multiple Regression
![Page 104: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/104.jpg)
The Least Squares Estimates:
0 1 2, , , , ,p
2
1
ˆn
i ii
RSS y y
2
0 1 1 2 21
n
i i i p pii
y x x x
- the values that minimize
![Page 105: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/105.jpg)
The Analysis of Variance Table Entries
a) Adjusted Total Sum of Squares (SSTotal)
b) Residual Sum of Squares (SSError)
c) Regression Sum of Squares (SSReg)
Note:
i.e. SSTotal = SSReg +SSError
SSTotal n
i1
yi y_2. d.f. n 1
RSS SSError n
i1
yi yi2. d.f. n p 1
SSReg SS1,2, . . . , p n
i1
yi y_2. d.f. p
n
i1
yi y_2
n
i1
yi y_2
n
i1
yi yi 2 .
![Page 106: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/106.jpg)
The Analysis of Variance Table
Source Sum of Squares d.f. Mean Square F
Regression SSReg p SSReg/p = MSReg MSReg/s2
Error SSError n-p-1 SSError/(n-p-1) =MSError = s2
Total SSTotal n-1
![Page 107: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/107.jpg)
Uses: 1. To estimate 2 (the error variance).
- Use s2 = MSError to estimate 2.
2. To test the Hypothesis
H0: 1 = 2= ... = p = 0.
Use the test statistic 2
Reg RegErrorF MS MS MS s
Reg 1ErrorSS p SS n p
- Reject H0 if F > F(p,n-p-1).
![Page 108: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/108.jpg)
3. To compute other statistics that are useful in describing the relationship between Y (the dependent variable) and X1, X2, ... ,Xp (the independent variables).a) R2 = the coefficient of determination
= SSReg/SSTotal
=
= the proportion of variance in Y explained by
X1, X2, ... ,Xp
1 - R2 = the proportion of variance in Y
that is left unexplained by X1, X2, ... , Xp
= SSError/SSTotal.
ˆ y i y 2
i1
n
y i y 2
i1
n
![Page 109: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/109.jpg)
b) Ra2 = "R2 adjusted" for degrees of freedom.
= 1 -[the proportion of variance in Y that is left unexplained by X1, X2,... , Xp adjusted for
d.f.]1 Error TotalMS MS
11
1Error
Total
SS n p
SS n
11
1Error
Total
n SS
n p SS
211 1
1
nR
n p
![Page 110: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/110.jpg)
c) R=R2 = the Multiple correlation coefficient of Y with X1, X2, ... ,Xp
=
= the maximum correlation between Y and a linear combination of X1, X2, ... ,Xp
Comment: The statistics F, R2, Ra2 and R are
equivalent statistics.
SSRe g
SSTotal
![Page 111: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/111.jpg)
Using Statistical Packages
To perform Multiple Regression
![Page 112: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/112.jpg)
Using SPSS
Note: The use of another statistical package such as Minitab is similar to using SPSS
![Page 113: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/113.jpg)
After starting the SSPS program the following dialogue box appears:
![Page 114: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/114.jpg)
If you select Opening an existing file and press OK the following dialogue box appears
![Page 115: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/115.jpg)
The following dialogue box appears:
![Page 116: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/116.jpg)
If the variable names are in the file ask it to read the names. If you do not specify the Range the program will identify the Range:
Once you “click OK”, two windows will appear
![Page 117: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/117.jpg)
One that will contain the output:
![Page 118: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/118.jpg)
The other containing the data:
![Page 119: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/119.jpg)
To perform any statistical Analysis select the Analyze menu:
![Page 120: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/120.jpg)
Then select Regression and Linear.
![Page 121: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/121.jpg)
The following Regression dialogue box appears
![Page 122: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/122.jpg)
Select the Dependent variable Y.
![Page 123: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/123.jpg)
Select the Independent variables X1, X2, etc.
![Page 124: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/124.jpg)
If you select the Method - Enter.
![Page 125: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/125.jpg)
All variables will be put into the equation.
There are also several other methods that can be used :
1. Forward selection
2. Backward Elimination
3. Stepwise Regression
![Page 126: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/126.jpg)
![Page 127: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/127.jpg)
Forward selection
1. This method starts with no variables in the equation
2. Carries out statistical tests on variables not in the equation to see which have a significant effect on the dependent variable.
3. Adds the most significant.
4. Continues until all variables not in the equation have no significant effect on the dependent variable.
![Page 128: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/128.jpg)
Backward Elimination
1. This method starts with all variables in the equation
2. Carries out statistical tests on variables in the equation to see which have no significant effect on the dependent variable.
3. Deletes the least significant.
4. Continues until all variables in the equation have a significant effect on the dependent variable.
![Page 129: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/129.jpg)
Stepwise Regression (uses both forward and backward techniques)
1. This method starts with no variables in the equation
2. Carries out statistical tests on variables not in the equation to see which have a significant effect on the dependent variable.
3. It then adds the most significant.
4. After a variable is added it checks to see if any variables added earlier can now be deleted.
5. Continues until all variables not in the equation have no significant effect on the dependent variable.
![Page 130: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/130.jpg)
All of these methods are procedures for attempting to find the best equation
The best equation is the equation that is the simplest (not containing variables that are not important) yet adequate (containing variables that are important)
![Page 131: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/131.jpg)
Once the dependent variable, the independent variables and the Method have been selected if you press OK, the Analysis will be performed.
![Page 132: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/132.jpg)
The output will contain the following table
Model Summary
.822a .676 .673 4.46Model1
R R SquareAdjustedR Square
Std. Errorof the
Estimate
Predictors: (Constant), WEIGHT, HORSE, ENGINEa.
R2 and R2 adjusted measures the proportion of variance in Y that is explained by X1, X2, X3, etc (67.6% and 67.3%)
R is the Multiple correlation coefficient (the maximum correlation between Y and a linear combination of X1, X2, X3, etc)
![Page 133: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/133.jpg)
The next table is the Analysis of Variance Table
The F test is testing if the regression coefficients of the predictor variables are all zero. Namely none of the independent variables X1, X2, X3, etc have any effect on Y
ANOVAb
16098.158 3 5366.053 269.664 .000a
7720.836 388 19.899
23818.993 391
Regression
Residual
Total
Model1
Sum ofSquares df
MeanSquare F Sig.
Predictors: (Constant), WEIGHT, HORSE, ENGINEa.
Dependent Variable: MPGb.
![Page 134: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/134.jpg)
The final table in the output
Gives the estimates of the regression coefficients, there standard error and the t test for testing if they are zeroNote: Engine size has no significant effect on Mileage
Coefficientsa
44.015 1.272 34.597 .000
-5.53E-03 .007 -.074 -.786 .432
-5.56E-02 .013 -.273 -4.153 .000
-4.62E-03 .001 -.504 -6.186 .000
(Constant)
ENGINE
HORSE
WEIGHT
Model1
B Std. Error
UnstandardizedCoefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: MPGa.
![Page 135: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/135.jpg)
The estimated equation from the table below:
Is:
Coefficientsa
44.015 1.272 34.597 .000
-5.53E-03 .007 -.074 -.786 .432
-5.56E-02 .013 -.273 -4.153 .000
-4.62E-03 .001 -.504 -6.186 .000
(Constant)
ENGINE
HORSE
WEIGHT
Model1
B Std. Error
UnstandardizedCoefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: MPGa.
5.53 5.56 4.6244.0
1000 100 1000Mileage Engine Horse Weight Error
![Page 136: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/136.jpg)
Note the equation is:
Mileage decreases with:
5.53 5.56 4.6244.0
1000 100 1000Mileage Engine Horse Weight Error
1. With increases in Engine Size (not significant, p = 0.432)With increases in Horsepower (significant, p = 0.000)With increases in Weight (significant, p = 0.000)
![Page 137: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/137.jpg)
Logistic regression
![Page 138: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/138.jpg)
Recall the simple linear regression model:
y = 0 + 1x +
where we are trying to predict a continuous dependent variable y from a continuous independent variable x.
This model can be extended to Multiple linear regression model:
y = 0 + 1x1 + 2x2 + … + + pxp + Here we are trying to predict a continuous dependent variable y from a several continuous dependent variables x1 , x2 , … , xp .
![Page 139: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/139.jpg)
Now suppose the dependent variable y is binary.
It takes on two values “Success” (1) or “Failure” (0)
This is the situation in which Logistic Regression is used
We are interested in predicting a y from a continuous dependent variable x.
![Page 140: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/140.jpg)
Example
We are interested how the success (y) of a new antibiotic cream is curing “acne problems” and how it depends on the amount (x) that is applied daily.
The values of y are 1 (Success) or 0 (Failure).
The values of x range over a continuum
![Page 141: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/141.jpg)
The logisitic Regression ModelLet p denote P[y = 1] = P[Success].
This quantity will increase with the value of x.
The ratio: 1
p
pis called the odds ratio
This quantity will also increase with the value of x, ranging from zero to infinity.
The quantity: ln1
p
p
is called the log odds ratio
![Page 142: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/142.jpg)
Example: odds ratio, log odds ratio
Suppose a die is rolled:Success = “roll a six”, p = 1/6
1 16 6
516 6
1
1 1 5
p
p
The odds ratio
1ln ln ln 0.2 1.69044
1 5
p
p
The log odds ratio
![Page 143: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/143.jpg)
The logisitic Regression Model
i. e. :
0 1
1xp
ep
In terms of the odds ratio
0 1ln1
px
p
Assumes the log odds ratio is linearly related to x.
![Page 144: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/144.jpg)
The logisitic Regression Model
0 1
1xp
ep
or
Solving for p in terms x.
0 1 1xp e p
0 1 0 1x xp pe e
0 1
0 11
x
x
ep
e
![Page 145: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/145.jpg)
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
Interpretation of the parameter 0
(determines the intercept)
p
0
01
e
e
x
![Page 146: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/146.jpg)
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
Interpretation of the parameter 1
(determines when p is 0.50 (along with 0))
p 0 1
0 1
1 1
1 1 1 2
x
x
ep
e
x
00 1
1
0 or x x
when
![Page 147: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/147.jpg)
Also0 1
0 11
x
x
dp d e
dx dx e
0
1
x
when
0 1 0 1 0 1 0 1
0 1
1 1
2
1
1
x x x x
x
e e e e
e
0 1
0 1
1 12 41
x
x
e
e
1
4
is the rate of increase in p with respect to x when p = 0.50
![Page 148: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/148.jpg)
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
Interpretation of the parameter 1
(determines slope when p is 0.50 )
p
x
1slope 4
![Page 149: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/149.jpg)
The data
The data will for each case consist of
1. a value for x, the continuous independent variable
2. a value for y (1 or 0) (Success or Failure)
Total of n = 250 cases
![Page 150: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/150.jpg)
case x y
1 0.8 02 2.3 13 2.5 04 2.8 15 3.5 16 4.4 17 0.5 08 4.5 19 4.4 110 0.9 011 3.3 112 1.1 013 2.5 114 0.3 115 4.5 116 1.8 017 2.4 118 1.6 019 1.9 120 4.6 1
case x y
230 4.7 1231 0.3 0232 1.4 0233 4.5 1234 1.4 1235 4.5 1236 3.9 0237 0.0 0238 4.3 1239 1.0 0240 3.9 1241 1.1 0242 3.4 1243 0.6 0244 1.6 0245 3.9 0246 0.2 0247 2.5 0248 4.1 1249 4.2 1250 4.9 1
![Page 151: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/151.jpg)
Estimation of the parameters
The parameters are estimated by Maximum Likelihood estimation and require a statistical package such as SPSS
![Page 152: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/152.jpg)
Using SPSS to perform Logistic regression
Open the data file:
![Page 153: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/153.jpg)
Choose from the menu:
Analyze -> Regression -> Binary Logistic
![Page 154: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/154.jpg)
The following dialogue box appears
Select the dependent variable (y) and the independent variable (x) (covariate).
Press OK.
![Page 155: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/155.jpg)
Here is the output
The Estimates and their S.E.
![Page 156: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/156.jpg)
The parameter Estimates
SE
X 1.0309 0.1334Constant -2.0475 0.332
1 1.03090 -2.0475
![Page 157: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/157.jpg)
Interpretation of the parameter 0
(determines the intercept)
0
0
-2.0475
-2.0475intercept 0.1143
1 1
e e
e e
Interpretation of the parameter 1
(determines when p is 0.50 (along with 0))
0
1
2.04751.986
1.0309x
![Page 158: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/158.jpg)
Another interpretation of the parameter 1
1
4
is the rate of increase in p with respect to x when p = 0.50
1 1.03090.258
4 4
![Page 159: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/159.jpg)
The dependent variable y is binary.
It takes on two values “Success” (1) or “Failure” (0)
The Logistic Regression Model
We are interested in predicting a y from a continuous dependent variable x.
![Page 160: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/160.jpg)
The logisitic Regression ModelLet p denote P[y = 1] = P[Success].
This quantity will increase with the value of x.
The ratio: 1
p
pis called the odds ratio
This quantity will also increase with the value of x, ranging from zero to infinity.
The quantity: ln1
p
p
is called the log odds ratio
![Page 161: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/161.jpg)
The logisitic Regression Model
i. e. :
0 1
1xp
ep
In terms of the odds ratio
0 1ln1
px
p
Assumes the log odds ratio is linearly related to x.
![Page 162: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/162.jpg)
The logisitic Regression Model
In terms of p
0 1
0 11
x
x
ep
e
![Page 163: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/163.jpg)
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
The graph of p vs x
p 0 1
0 11
x
x
ep
e
x
![Page 164: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/164.jpg)
The Multiple Logistic Regression model
![Page 165: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/165.jpg)
Here we attempt to predict the outcome of a binary response variable Y from several independent variables X1, X2 , … etc
0 1 1ln1 p p
pX X
p
0 1 1
0 1 1 or
1
p p
p p
X X
X X
ep
e
![Page 166: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/166.jpg)
Multiple Logistic Regression an example
In this example we are interested in determining the risk of infants (who were born prematurely) of developing BPD (bronchopulmonary dysplasia)
More specifically we are interested in developing a predictive model which will determine the probability of developing BPD from
X1 = gestational Age and X2 = Birthweight
![Page 167: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/167.jpg)
For n = 223 infants in prenatal ward the following measurements were determined
1. X1 = gestational Age (weeks),
2. X2 = Birth weight (grams) and3. Y = presence of BPD
![Page 168: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/168.jpg)
The datacase Gestational Age Birthweight presence of BMD
1 28.6 1119 12 31.5 1222 03 30.3 1311 14 28.9 1082 05 30.3 1269 06 30.5 1289 07 28.5 1147 08 27.9 1136 19 30 972 0
10 31 1252 011 27.4 818 012 29.4 1275 013 30.8 1231 014 30.4 1112 015 31.1 1353 116 26.7 1067 117 27.4 846 118 28 1013 019 29.3 1055 020 30.4 1226 021 30.2 1237 022 30.2 1287 023 30.1 1215 024 27 929 125 30.3 1159 026 27.4 1046 1
![Page 169: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/169.jpg)
The resultsVariables in the Equation
-.003 .001 4.885 1 .027 .998
-.505 .133 14.458 1 .000 .604
16.858 3.642 21.422 1 .000 2.1E+07
Birthweight
GestationalAge
Constant
Step1
a
B S.E. Wald df Sig. Exp(B)
Variable(s) entered on step 1: Birthweight, GestationalAge.a.
ln 16.858 .003 .5051
pBW GA
p
16.858 .003 .505
1BW GAp
ep
16.858 .003 .505
16.858 .003 .5051
BW GA
BW GA
ep
e
![Page 170: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/170.jpg)
Graph: Showing Risk of BPD vs GA and BrthWt
0
0.2
0.4
0.6
0.8
1
700 900 1100 1300 1500 1700
GA = 27
GA = 28
GA = 29
GA = 30
GA = 31
GA = 32
![Page 171: Linear Regression](https://reader037.vdocuments.us/reader037/viewer/2022110102/56813e11550346895da7f160/html5/thumbnails/171.jpg)
Non-Parametric Statistics