regression04: 1 an introduction to regression and correlation

Regression04: 1

An Introduction to

REGRESSION AND CORRELATION

Regression04: 2

How do we measure the association of 2 continuous, numeric scale variables?

Example:

Observations are available on

• a sample of 30 individuals

• systolic blood pressure (SBP)

• age

We are interested in

• the relationship between SBP and age

• for these patients (descriptive)

• and for the population which they represent (inferential).

Regression04: 3

individual SBP AGE individual SBP AGE(i) (Y) (x) (i) (Y) (x)1 144 39 16 130 482 220 47 17 135 453 138 45 18 114 174 145 47 19 116 205 162 65 20 124 196 142 46 21 136 367 170 67 22 142 508 124 42 23 120 399 158 67 24 120 21

10 154 56 25 160 4411 162 64 26 158 5312 150 56 27 144 6313 140 59 28 130 2914 110 34 29 125 2515 128 42 30 175 69

Data on 30 individuals:

Regression04: 4

Note:

We have 30 pairs of observations which we can denote as:

(x1,y1) = (39,144)

(x2,y2 ) = (47, 220)

…

(x30,y30) = (69, 175)

Where

• xi refers to age for the i th subject

• yi to SBP for the i th subject

Regression04: 5

• These data pairs may be considered as points in two dimensional space, so that we may plot them on a graph.

Scatter diagram of age and systolic blood pressure

120

140

160

180

200

220

240

20 30 40 50 60 70 80

AGE in years

0 0

SBP(mm Hg)

Note:

• age and SBP seem to be related:

• Younger subjects tend to have lower SBP

• older subjects higher SBP.

Regression04: 6

How can this relationship be measured?

y

x

No relationship between x and y. Spread is even in all directions.

y

x

Linear relationship:A line indicates the main direction of the spread of points.

y

x

Non-linear relationship between x and y.A curve best describes the relationship.

Regression04: 7

Math Review: Equation for a Line

y

0

0 1y x

x

o= y-intercept = value of y when x=0

1= slope = y / x

(change in y)/(change in x)

Regression04: 8

y

Slope > 0: positive slope (as x increases, y increases)

Slope = 0

Slope < 0: negative slope(as x increases, y decreases)

y x

x

1 = “slope” = y / x = (change in y) / (change in x)

Regression04: 9

Now, given a set of data, how can we get the line that best fits or best represents the data?

When it is appropriate to predict one variable (y) from another variable (x) -- there is some directionality in the relationship – then :

• Commonly use a technique know as Least Squares Regression to estimate

intercept: 0

slope: 1

• denote the estimates 0 and 1, respectively^ ^

(referred to as beta-nought-hat and beta-one-hat)

Regression04: 10

d i

di

•

•

0 1ˆ ˆy x

We are looking for that line which minimizes the vertical distances to the data points.

For each observed value xi, we have

• an observed yi, and the

• “predicted” value yi, on the line: yi = 0+ 1xi

The vertical distances are : di = (yi – yi).^

^ ^^ ^

Regression04: 11

•

•

•

• • • •

•

yi

yi

xi

(xi,yi)

(xi,yi)^

x

y

0 1ˆ ˆy x

That is, we have:

xi = observed x for ith subject

yi = observed y for ith subject

yi = predicted y for ith subject^

Regression04: 12

The squared distances are:

di2 = (yi – yi)2

– and the sum of squared deviations from the line

– (sound familiar?) is

di2 = (yi – yi)2

We want the line such that

is minimized.

^

2 2 20 1

1 1 1

ˆ( ) ( )n n n

i i i i ii i i

d y y y x

^

ˆiy

Regression04: 13

The unbiased estimates of 0 and 1 which are

• the least squares estimates and

• the minimum variance estimates

Are:

11

2

1

( )( )ˆ

( )

n

i ii

n

ii

x x y y

x x

0 1ˆ ˆy x

Use calculus in previous equations to solve

Regression04: 14

Example:

Using the data on 30 individuals where we measured

• AGE (x)

• SBP (y)

n = 30, y = 142.53, x = 45.13

We get:

1ˆ 0.97

0ˆ 98.7

Regression04: 15

Thus, the equation for this straight line is given by

ˆ 98.7 0.97y x

0

AGE

120

140

160

180

200

220

240

20 30 40 50 60 70 80

Regression04: 16

Now,

If yi = yi for all i, then SSE=0 perfect fit to line

As the fit gets worse, SSE gets larger

• SSE serves as measure of fit to line

2 2

1 1

ˆ( )n n

i i ii i

d y y SSE Sum of Squared Error

^

Regression04: 17

One of the assumptions for regression analysis is that of homoscedasticity:

• the variance of y is the same for any xthat is, the spread of values for y at each level of x remains ~constant

y

x

1 2 3

2 2 2 2| | |y x y x y x

Spread of y|x

Spread of y ignoring x

Regression04: 18

An estimate of 2 is given by:

2 2|

1

1 1ˆ( )

2 2

n

y x i ii

s y y SSEn n

Lose 2 df: for estimating 0 and 1

The standard error, sy|x

• is a measure of the spread of y

• around it’s predicted value y

• for each value of x.

^

Regression04: 19

In our example:

And the estimated standard error is:

That is, • for any given age x, • the standard error of SBP is estimated as

17.31 mmhg.

2| 299.77y xs

| 17.31y xs

Regression04: 20

ˆ 98.7 0.97y x

0

AGE

120

140

160

180

200

220

240

20 30 40 50 60 70 80

To address the question of association of x and y

• We want to know if the slope is zero:

• Ho: 1=0

• Ha: 10

Regression04: 21

Now, if we assume

• that for any fixed value of x

• y is normally distributed

Then we can show that:

In practice, since 2 is unknown

• Use sy|x2 in place of 2

• Use the t-distribution, with n-2 df

• For hypothesis testing and CI

2|

1 1 2ˆ ~ ,

( 1)y x

x

Ns n

Regression04: 22

With these assumptions, to test

• Ho: 1=0

• Ha: 10

Test statistic:

1 12

|

ˆ ˆ( )

1

on

y x

x

ts

s n

Regression04: 23

1 12

|

ˆ ˆ( ) 0.97 04.62

17.31

15.29 291

on

y x

x

ts

s n

In our example:

The achieved significance is then:

With p<.05,

Reject Ho and conclude that age (x) provides

significant information for predicting SBP (y).

28(2)Pr[ 4.62] 0p t

Regression04: 24

In Minitab, enter the data in 2 columns, for SBP and AGE, and select: Stat Regression Regression

Response is Y variable

Predictor is X variable

Regression04: 25

Regression Analysis: spb versus age

The regression equation is

spb = 98.7 + 0.971 age

Predictor Coef SE Coef T P

Constant 98.71 10.00 9.87 0.000

age 0.9709 0.2102 4.62 0.000

S = 17.31 R-Sq = 43.2% R-Sq(adj) = 41.2%

Analysis of Variance

Source DF SS MS F P

Regression 1 6394.0 6394.0 21.33 0.000

Error 28 8393.4 299.8

Total 29 14787.5

Regression04: 26

You’ll note that a significance test is also provided for 0:

H0: 0=0 vs. Ha: 00

T P

9.87 0.000

We are rarely interested in tests of 0.

• It is often outside of the range of the data (e.g., here the youngest age is ~20)

• In this case it can be interpreted as the predicted SBP at age=0 – not meaningful.

• It is inappropriate to interpret regression relationships outside the range of the observed data.

Regression04: 27

A better model might exist • (e.g, one with a curvilinear term) • but there is a linear component.

••

•

•••

•

••

•

•

•

••

•

•

•••

••

•

•

•

••• ••

•

•

•

•

••

•

•

••

•

•

•

•••

•

••

• ••

• •

•

•

y

••

•

•

•

•

•

••

•

•

•

•

•

•

••

•

••

••

•

•

•

•

•

•

•

• ••

•

•

Here, a curve would provide a better fit

•

•• •

••

Linear model fits better than y = y ^

Regression04: 28

note: if Ho:1 = 0 is not rejected it means either

•

••

•

••

••

•

•••

••

•

••

•

•••

• •

• •• • ••

• ••

•

• •

x provides little or no help in predicting y

The true relationship between x and y is not linear.

•••

• ••• •

•

•

• •

••

•••

••

• ••

•

•

•

•

•••

•

••

•

•

••

•• •

•

•

•

• ••

•

•••

•

or

Note: even when Ho: 1=0 is rejected, some other non-linear model may be better

Regression04: 29

Part 2: The Correlation Coefficient

• Provides a measure of how 2 random variables are associated, without assuming any direction to the association (i.e., no sense that x is predictive of y, just that they are related)

• Also a measure of the strength of the straight-line relationship between X and Y

• It can also be shown that:

1

2 22 2

1 1

( )( )ˆcov( , )

( )( )( ) ( )

n

i ii

n nx y

i ii i

x x y yx y

rs s

x x y y

1ˆx

y

srs

Regression04: 30

Characteristics of correlation coefficient r:

• -1 r 1

• -1 implies perfect negative correlation

• 0 implies no correlation

• 1 implies perfect positive correlation

• r is dimensionless – it is independent of units of x or y

• r always has same sign as slope

• r is the sample estimator of the population correlation

Regression04: 31

yi y

xi x

xi x

yi y

+

II

III IV

y

x

-

x

I+

+

+-

-

-

xi x

xi x

yi y

yi y

If we

• divide the data into 4 quadrants by lines at the means of x and y

• and for each point, examine the direction of the deviation from these means:

for (xi, yi)

examine sign (+/-) of:

(xi – x) and (yi – y) for

each quadrant …

Regression04: 32

yi y

xi x

xi x

yi y

+

II

III IV

y

x

-

x

I+

+

+-

-

-

xi x

xi x

yi y

yi y

Quadrant

I + + +II - + -III - - +IV + - -

xi x yi y xi x yi y

Covariance between x and y:

1

( )( )N

i x i yi

xy

x y

N

Regression04: 33

xyxy

x y

Correlation between x and y:

Now, if points look like:

Since most points are in

QI and QIII: xy> 0

> 0, 1 >0

Since most points are in

QII and QIV: xy< 0

< 0, 1 <0

Regression04: 34

Since points are in all 4

quadrants: xy= 0

= 0, 1 = 0

Regression04: 35

(a) (b)

Correlation, r , in (a) is greater than r in (b), since points are closer to line in (a)

This is true, even when the slopes are the same.

Regression04: 36

Testing Hypotheses on Correlation:

To test

• Ho: = 0 vs. Ha: 0

• Use:

• It is identical to testing for 1 = 0

12 2

|

ˆ2

11

ny x

x

r nt

srs n

^

Regression04: 37

In Minitab: Stat Basic Stats Correlation

Correlations: sbp, age

Pearson correlation of sbp and age = 0.658P-Value = 0.000

Regression04: 38

Note that the Regression Analysis results provide a value for r2 (see slide 25):

R-Sq = 43.2%

Use this to compute r = .432 = .657

We also have the significance test for zero correlation:

Ho: =0 vs. Ha: 0

Since it is identical to the test of zero slope:

T P4.62 0.000

Regression04: 39

• Regression and Correlation Analysis are closely related

• Correlation evaluates the strength of a linear association

• Does not impose any directionality on the relationship

• Regression evaluates strength of a linear relationship (slope of line)

• Direction is imposed( e.g., age SBP rather than the reverse)

• Significance test on slope, 1, is equivalent to

significance test on correlation r

^

regression04: 1 an introduction to regression and correlation

Documents