regression04: 1 an introduction to regression and correlation
Post on 21-Dec-2015
215 views
TRANSCRIPT
![Page 1: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/1.jpg)
Regression04: 1
An Introduction to
REGRESSION AND CORRELATION
![Page 2: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/2.jpg)
Regression04: 2
How do we measure the association of 2 continuous, numeric scale variables?
Example:
Observations are available on
• a sample of 30 individuals
• systolic blood pressure (SBP)
• age
We are interested in
• the relationship between SBP and age
• for these patients (descriptive)
• and for the population which they represent (inferential).
![Page 3: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/3.jpg)
Regression04: 3
individual SBP AGE individual SBP AGE(i) (Y) (x) (i) (Y) (x)1 144 39 16 130 482 220 47 17 135 453 138 45 18 114 174 145 47 19 116 205 162 65 20 124 196 142 46 21 136 367 170 67 22 142 508 124 42 23 120 399 158 67 24 120 21
10 154 56 25 160 4411 162 64 26 158 5312 150 56 27 144 6313 140 59 28 130 2914 110 34 29 125 2515 128 42 30 175 69
Data on 30 individuals:
![Page 4: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/4.jpg)
Regression04: 4
Note:
We have 30 pairs of observations which we can denote as:
(x1,y1) = (39,144)
(x2,y2 ) = (47, 220)
…
(x30,y30) = (69, 175)
Where
• xi refers to age for the i th subject
• yi to SBP for the i th subject
![Page 5: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/5.jpg)
Regression04: 5
• These data pairs may be considered as points in two dimensional space, so that we may plot them on a graph.
Scatter diagram of age and systolic blood pressure
120
140
160
180
200
220
240
20 30 40 50 60 70 80
AGE in years
0 0
SBP(mm Hg)
Note:
• age and SBP seem to be related:
• Younger subjects tend to have lower SBP
• older subjects higher SBP.
![Page 6: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/6.jpg)
Regression04: 6
How can this relationship be measured?
y
x
No relationship between x and y. Spread is even in all directions.
y
x
Linear relationship:A line indicates the main direction of the spread of points.
y
x
Non-linear relationship between x and y.A curve best describes the relationship.
![Page 7: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/7.jpg)
Regression04: 7
Math Review: Equation for a Line
y
0
0 1y x
x
o= y-intercept = value of y when x=0
1= slope = y / x
(change in y)/(change in x)
![Page 8: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/8.jpg)
Regression04: 8
y
Slope > 0: positive slope (as x increases, y increases)
Slope = 0
Slope < 0: negative slope(as x increases, y decreases)
y x
x
1 = “slope” = y / x = (change in y) / (change in x)
![Page 9: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/9.jpg)
Regression04: 9
Now, given a set of data, how can we get the line that best fits or best represents the data?
When it is appropriate to predict one variable (y) from another variable (x) -- there is some directionality in the relationship – then :
• Commonly use a technique know as Least Squares Regression to estimate
intercept: 0
slope: 1
• denote the estimates 0 and 1, respectively^ ^
(referred to as beta-nought-hat and beta-one-hat)
![Page 10: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/10.jpg)
Regression04: 10
d i
di
•
•
0 1ˆ ˆy x
We are looking for that line which minimizes the vertical distances to the data points.
For each observed value xi, we have
• an observed yi, and the
• “predicted” value yi, on the line: yi = 0+ 1xi
The vertical distances are : di = (yi – yi).^
^ ^^ ^
![Page 11: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/11.jpg)
Regression04: 11
•
•
•
• • • •
•
yi
yi
xi
(xi,yi)
(xi,yi)^
x
y
0 1ˆ ˆy x
That is, we have:
xi = observed x for ith subject
yi = observed y for ith subject
yi = predicted y for ith subject^
![Page 12: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/12.jpg)
Regression04: 12
The squared distances are:
di2 = (yi – yi)2
– and the sum of squared deviations from the line
– (sound familiar?) is
di2 = (yi – yi)2
We want the line such that
is minimized.
^
2 2 20 1
1 1 1
ˆ( ) ( )n n n
i i i i ii i i
d y y y x
^
ˆiy
![Page 13: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/13.jpg)
Regression04: 13
The unbiased estimates of 0 and 1 which are
• the least squares estimates and
• the minimum variance estimates
Are:
11
2
1
( )( )ˆ
( )
n
i ii
n
ii
x x y y
x x
0 1ˆ ˆy x
Use calculus in previous equations to solve
![Page 14: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/14.jpg)
Regression04: 14
Example:
Using the data on 30 individuals where we measured
• AGE (x)
• SBP (y)
n = 30, y = 142.53, x = 45.13
We get:
1ˆ 0.97
0ˆ 98.7
![Page 15: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/15.jpg)
Regression04: 15
Thus, the equation for this straight line is given by
ˆ 98.7 0.97y x
0
AGE
120
140
160
180
200
220
240
20 30 40 50 60 70 80
![Page 16: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/16.jpg)
Regression04: 16
Now,
If yi = yi for all i, then SSE=0 perfect fit to line
As the fit gets worse, SSE gets larger
• SSE serves as measure of fit to line
2 2
1 1
ˆ( )n n
i i ii i
d y y SSE Sum of Squared Error
^
![Page 17: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/17.jpg)
Regression04: 17
One of the assumptions for regression analysis is that of homoscedasticity:
• the variance of y is the same for any xthat is, the spread of values for y at each level of x remains ~constant
y
x
1 2 3
2 2 2 2| | |y x y x y x
Spread of y|x
Spread of y ignoring x
![Page 18: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/18.jpg)
Regression04: 18
An estimate of 2 is given by:
2 2|
1
1 1ˆ( )
2 2
n
y x i ii
s y y SSEn n
Lose 2 df: for estimating 0 and 1
The standard error, sy|x
• is a measure of the spread of y
• around it’s predicted value y
• for each value of x.
^
![Page 19: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/19.jpg)
Regression04: 19
In our example:
And the estimated standard error is:
That is, • for any given age x, • the standard error of SBP is estimated as
17.31 mmhg.
2| 299.77y xs
| 17.31y xs
![Page 20: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/20.jpg)
Regression04: 20
ˆ 98.7 0.97y x
0
AGE
120
140
160
180
200
220
240
20 30 40 50 60 70 80
To address the question of association of x and y
• We want to know if the slope is zero:
• Ho: 1=0
• Ha: 10
![Page 21: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/21.jpg)
Regression04: 21
Now, if we assume
• that for any fixed value of x
• y is normally distributed
Then we can show that:
In practice, since 2 is unknown
• Use sy|x2 in place of 2
• Use the t-distribution, with n-2 df
• For hypothesis testing and CI
2|
1 1 2ˆ ~ ,
( 1)y x
x
Ns n
![Page 22: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/22.jpg)
Regression04: 22
With these assumptions, to test
• Ho: 1=0
• Ha: 10
Test statistic:
1 12
|
ˆ ˆ( )
1
on
y x
x
ts
s n
![Page 23: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/23.jpg)
Regression04: 23
1 12
|
ˆ ˆ( ) 0.97 04.62
17.31
15.29 291
on
y x
x
ts
s n
In our example:
The achieved significance is then:
With p<.05,
Reject Ho and conclude that age (x) provides
significant information for predicting SBP (y).
28(2)Pr[ 4.62] 0p t
![Page 24: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/24.jpg)
Regression04: 24
In Minitab, enter the data in 2 columns, for SBP and AGE, and select: Stat Regression Regression
Response is Y variable
Predictor is X variable
![Page 25: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/25.jpg)
Regression04: 25
Regression Analysis: spb versus age
The regression equation is
spb = 98.7 + 0.971 age
Predictor Coef SE Coef T P
Constant 98.71 10.00 9.87 0.000
age 0.9709 0.2102 4.62 0.000
S = 17.31 R-Sq = 43.2% R-Sq(adj) = 41.2%
Analysis of Variance
Source DF SS MS F P
Regression 1 6394.0 6394.0 21.33 0.000
Error 28 8393.4 299.8
Total 29 14787.5
![Page 26: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/26.jpg)
Regression04: 26
You’ll note that a significance test is also provided for 0:
H0: 0=0 vs. Ha: 00
T P
9.87 0.000
We are rarely interested in tests of 0.
• It is often outside of the range of the data (e.g., here the youngest age is ~20)
• In this case it can be interpreted as the predicted SBP at age=0 – not meaningful.
• It is inappropriate to interpret regression relationships outside the range of the observed data.
![Page 27: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/27.jpg)
Regression04: 27
A better model might exist • (e.g, one with a curvilinear term) • but there is a linear component.
••
•
•••
•
••
•
•
•
••
•
•
•••
••
•
•
•
••• ••
•
•
•
•
••
•
•
••
•
•
•
•••
•
••
• ••
• •
•
•
y
••
•
•
•
•
•
••
•
•
•
•
•
•
••
•
••
••
•
•
•
•
•
•
•
• ••
•
•
Here, a curve would provide a better fit
•
•• •
••
Linear model fits better than y = y ^
![Page 28: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/28.jpg)
Regression04: 28
note: if Ho:1 = 0 is not rejected it means either
•
••
•
••
••
•
•••
••
•
••
•
•••
• •
• •• • ••
• ••
•
• •
x provides little or no help in predicting y
The true relationship between x and y is not linear.
•••
• ••• •
•
•
• •
••
•••
••
• ••
•
•
•
•
•••
•
••
•
•
••
•• •
•
•
•
• ••
•
•••
•
or
Note: even when Ho: 1=0 is rejected, some other non-linear model may be better
![Page 29: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/29.jpg)
Regression04: 29
Part 2: The Correlation Coefficient
• Provides a measure of how 2 random variables are associated, without assuming any direction to the association (i.e., no sense that x is predictive of y, just that they are related)
• Also a measure of the strength of the straight-line relationship between X and Y
• It can also be shown that:
1
2 22 2
1 1
( )( )ˆcov( , )
( )( )( ) ( )
n
i ii
n nx y
i ii i
x x y yx y
rs s
x x y y
1ˆx
y
srs
![Page 30: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/30.jpg)
Regression04: 30
Characteristics of correlation coefficient r:
• -1 r 1
• -1 implies perfect negative correlation
• 0 implies no correlation
• 1 implies perfect positive correlation
• r is dimensionless – it is independent of units of x or y
• r always has same sign as slope
• r is the sample estimator of the population correlation
![Page 31: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/31.jpg)
Regression04: 31
yi y
xi x
xi x
yi y
+
II
III IV
y
x
-
x
I+
+
+-
-
-
xi x
xi x
yi y
yi y
If we
• divide the data into 4 quadrants by lines at the means of x and y
• and for each point, examine the direction of the deviation from these means:
for (xi, yi)
examine sign (+/-) of:
(xi – x) and (yi – y) for
each quadrant …
![Page 32: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/32.jpg)
Regression04: 32
yi y
xi x
xi x
yi y
+
II
III IV
y
x
-
x
I+
+
+-
-
-
xi x
xi x
yi y
yi y
Quadrant
I + + +II - + -III - - +IV + - -
xi x yi y xi x yi y
Covariance between x and y:
1
( )( )N
i x i yi
xy
x y
N
![Page 33: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/33.jpg)
Regression04: 33
xyxy
x y
Correlation between x and y:
Now, if points look like:
Since most points are in
QI and QIII: xy> 0
> 0, 1 >0
Since most points are in
QII and QIV: xy< 0
< 0, 1 <0
![Page 34: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/34.jpg)
Regression04: 34
Since points are in all 4
quadrants: xy= 0
= 0, 1 = 0
![Page 35: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/35.jpg)
Regression04: 35
(a) (b)
Correlation, r , in (a) is greater than r in (b), since points are closer to line in (a)
This is true, even when the slopes are the same.
![Page 36: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/36.jpg)
Regression04: 36
Testing Hypotheses on Correlation:
To test
• Ho: = 0 vs. Ha: 0
• Use:
• It is identical to testing for 1 = 0
12 2
|
ˆ2
11
ny x
x
r nt
srs n
^
![Page 37: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/37.jpg)
Regression04: 37
In Minitab: Stat Basic Stats Correlation
Correlations: sbp, age
Pearson correlation of sbp and age = 0.658P-Value = 0.000
![Page 38: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/38.jpg)
Regression04: 38
Note that the Regression Analysis results provide a value for r2 (see slide 25):
R-Sq = 43.2%
Use this to compute r = .432 = .657
We also have the significance test for zero correlation:
Ho: =0 vs. Ha: 0
Since it is identical to the test of zero slope:
T P4.62 0.000
![Page 39: Regression04: 1 An Introduction to REGRESSION AND CORRELATION](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a4a8f7/html5/thumbnails/39.jpg)
Regression04: 39
• Regression and Correlation Analysis are closely related
• Correlation evaluates the strength of a linear association
• Does not impose any directionality on the relationship
• Regression evaluates strength of a linear relationship (slope of line)
• Direction is imposed( e.g., age SBP rather than the reverse)
• Significance test on slope, 1, is equivalent to
significance test on correlation r
^