linear correlation analysis -...

Post on 22-Feb-2019

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Linear Correlation Analysis

Spring 2005

• Superstitions

– Walking under a

ladder

– Opening an umbrella

indoors

• Empirical Evidence

– Consumption of ice

cream and drownings

are generally

positively correlated.

Can we reduce the

number of drownings

if we prohibit ice

cream sales in the

summer?

3 kinds of relationships between variables

• Association or Correlation or Covary

– Both variables tend to be high or low (positive relationship) or one tends to be high when the other is low (negative relationship). Variables do not have independent & dependent roles.

• Prediction

– Variables are assigned independent and dependent roles. Both variables are observed. There is a weak causal implication that the independent predictor variable is the cause and the dependent variable is the effect.

• Causal

– Variables are assigned independent and dependent roles. The independent variable is manipulated and the dependent variable is observed. Strong causal statements are allowed.

General Overview of

Correlational Analysis

• The purpose is to measure the strength of a linear relationship between 2 variables.

• A correlation coefficient does not ensure “causation” (i.e. a change in X causes a change in Y)

• X is typically the Input, Measured, or Independent variable.

• Y is typically the Output, Predicted, or Dependent variable.

• If, as X increases, there is a predictable shift in the values of Y, a correlation exists.

General Properties of

Correlation Coefficients

• Values can range between +1 and -1

• The value of the correlation coefficient

represents the scatter of points on a

scatterplot

• You should be able to look at a scatterplot

and estimate what the correlation would be

• You should be able to look at a correlation

coefficient and visualize the scatterplot

Perfect Linear Correlation

• Occurs when all the points in a scatterplot

fall exactly along a straight line.

Positive Correlation

Direct Relationship

• As the value of X increases, the value of Y also increases

• Larger values of X tend to be paired with larger values of Y (and consequently, smaller values of X and Y tend to be paired)

Negative Correlation

Inverse Relationship

• As the value of X

increases, the value of Y

decreases

• Small values of X tend to

be paired with large value

of Y (and vice versa).

Non-Linear Correlation

• As the value of X increases, the value of Y

changes in a non-linear manner

No Correlation

• As the value of X

changes, Y does not

change in a predictable

manner.

• Large values of X seem

just as likely to be paired

with small values of Y as

with large values of Y

Interpretation

• Depends on what the purpose of the study

is… but here is a “general guideline”...

• Value = magnitude of the relationship

• Sign = direction of the relationship

Some of the many

Types of Correlation Coefficients (there are lot’s more…)

Name X variable Y variable

Pearson r Interval/Ratio Interval/Ratio

Spearman rho Ordinal Ordinal

Kendall's Tau Ordinal Ordinal

Phi Dichotomous Dichotomous

Intraclass R Interval/Ratio

Test

Interval/Ratio

Retest

Some of the many

Types of Correlation Coefficients (there are lot’s more…. these are the ones we will

focus on this semester)

Name X variable Y variable

Pearson r Interval/Ratio Interval/Ratio

Spearman rho Ordinal Ordinal

Kendall's Tau Ordinal Ordinal

Phi Dichotomous Dichotomous

Intraclass R Interval/Ratio

Test

Interval/Ratio

Retest

Included in SPSS

“Bivariate Correlation”

procedure

The Pearson Product-Moment

Correlation (r)

• Named after Karl Pearson (1857-1936)

• Both X and Y measured at the

Interval/Ratio level

• Most widely used coefficient

in the literature

The Pearson Product-

Moment Correlation (r)

• A measure of the extent to

which paired scores occupy the

same or opposite positions

within their own distributions

From: Pagano (1994)

Computing Pearson r

Hand Calculation

Computing Pearson r

in EXCEL Step #1

Step #2: Insert Function (Pearson)

Subject X Y

A 1 2

B 3 5

C 4 3

D 6 7

E 7 5

0.73Pearson r =

Step #4: Format output Step #3: Select X and Y data

Computing Pearson r

in SPSS Step #1

Step #4: Means + SD’s

Step #2: Analyze-Correlate-Bivariate

Step #3: Select X and Y data

Computing Pearson r

in SPSS Output #1

Descriptive Statistics

4.20 2.387 5

4.40 1.949 5

VARX

VARY

Mean Std. Deviation N

Output #2:

Correlations

1 .731

. .161

5 5

.731 1

.161 .

5 5

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

VARX

VARY

VARX VARY

Interpretation

• r = 0.73 : p = .161

The researchers found a moderate, but not-

significant, relationship between X and Y

SAMPLE SIZE: One of the many issues involved with the

interpretation of correlation

coefficients

Descriptive Statistics

4.20 2.179 25

4.40 1.780 25

VARX

VARY

Mean Std. Deviation N

Correlations

1 .731**

. .000

25 25

.731** 1

.000 .

25 25

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

VARX

VARY

VARX VARY

Correlation is significant at the 0.01 level

(2-tailed).

**.

Interpretation

• r = 0.73 : p = .000

The researchers found a significant

moderate relationship between X and Y

How can this be?

• The distribution of Pearson r is not symmetrically

shaped as r approaches ± 1 (see http://davidmlane.com/hyperstat/A98696.html for more information)

• Examining the 95% confidence interval for r

An additional way to

Interpret Pearson r

• Coefficient of Determination

– r2

– The proportion of the variability of Y

accounted for by X

Variability of Y

This area of overlap

represents the proportion of

variability of Y accounted

for by X (value is expressed

as a %)

X

Correlation Identification Practice

• Let’s see if you can identify the value for

the correlation coefficient from a

scatterplot…

• Click to begin

0102030405060708090

100

0 10 20 30 40 50 60 70 80 90 100

Variable X

Variable Y

0102030405060708090100

0 10 20 30 40 50 60 70 80 90 100

Variable X

Variable Y

Outliers • Observations that clearly appear to be out of range of the other observations.

r = 0.97

r = 0.72

What to do with Outliers You are stuck with them unless…..

• Check to see if there has been a data entry error.

If so, fix the data.

• Check to see if these values are plausible. Is this

score within the minimum and maximum score

possible? If values are impossible, delete the

data. Report how many scores were deleted.

• Examine other variables for these subjects to see

if you can find an explanation for these scores

being so different from the rest. You might be

able to delete them if your reasoning is sound.

Correlation & Attenuation

• Restricting the range of scores can have a large

impact on a correlation coefficient.

0102030405060708090

100

0 10 20 30 40 50 60 70 80 90 100

Variable X

Variable Y LOW

MEDIUM

HIGH

r = 0.72

Low Group

r = 0.55 0102030405060708090

100

0 10 20 30 40 50 60 70 80 90 100

Variable X

Variable Y

LOW

0

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 35

Variable X

Variable Y

0102030405060708090

100

0 10 20 30 40 50 60 70 80 90 100

Variable X

Variable Y

Medium Group

r = 0.86

MEDIUM

20

30

40

50

60

70

80

20 30 40 50 60 70

Variable X

Variable Y

0102030

405060708090100

0 10 20 30 40 50 60 70 80 90 100

Variable X

Variable Y

High Group

r = 0.67

HIGH

60

70

80

90

100

60 70 80 90 100

Variable X

Variable Y

0102030

405060708090

100

0 10 20 30 40 50 60 70 80 90 100

Variable X

Variable Y LOW

r=0.55

MEDIUM

r=0.86

HIGH

r=0.67

Using all of the data…

r = 0.72

0

20

40

60

80

100

120

140

0 20 40 60 80 100 120 140

X variable

Y variable

Men

Women

Here’s another problem with interpreting Correlation

Coefficients that you should watch out for…..

Men

r = -0.21

Women

r = +0.22

All data combined

r = +0.89

Reporting a set of Correlation

Coefficients in a table

Complete correlation matrix.

Notice redundancy.

Lower triangular correlation

matrix. Values are not repeated.

There is also an upper triangular

matrix!

Spearman Rho (rs)

• Named after Charles E.

Spearman (1863-1945)

• Assumptions:

– Data consist of a random

sample of n pairs of numeric

or non-numeric observations

that can be ranked.

– Each pair of observations

represents two measurement

taken on the same object or

individual.

Photo from: http://www.york.ac.uk/depts/maths/histstat/people/sources.htm

Why choose Spearman rho

instead of a Pearson r?

�Both X and Y are measured at the ordinal level

�Sample size is small

�X and Y are measured at the interval/ratio level, but are not normally distributed (e.g. are severely skewed)

�X and Y do not follow a bivariate normal distribution

What is a Bivariate Normal Distribution?

What is a Bivariate Normal Distribution?

Sample Problem

• Pincherle and Robinson (1974) note a marked

inter-observer variation in blood pressure

readings. They found that doctors who read high

on systolic tended to read high on diastolic. Table

1 shows the mean systolic and diastolic blood

pressure reading by 14 doctors.

• Research question: What is the strength of the

relationship between the two variables?

Pincherle, G. & Robinson, D. (1974). Mean blood pressure and its relation to other factors determined at a routine

executive health examination. J. Chronic Dis., 27, 245-260.

Doctor ID Systolic Diastolic

1 141.8 89.7

2 140.2 74.4

3 131.8 83.5

4 132.5 77.8

5 135.7 85.8

6 141.2 86.5

7 143.9 89.4

8 140.2 89.3

9 140.8 88.0

10 131.7 82.2

11 130.8 84.6

12 135.6 84.4

13 143.6 86.3

14 133.2 85.9

Table 1.

Mean blood pressure readings,

millimeters mercury, by doctor.

Research question: What is the strength

of the relationship between the two

variables?

Option #1: Compute a Pearson r

If you do not feel this data meet with

assumptions of the Pearson r… then

Option #2: Convert data to Ranks and

then compute a Spearman rho

We will be going over how to check the

assumptions on Wednesday when we talk

about Regression

Computation of Spearman Rho

Step #1

• Rank each X relative to all other

observed values of X from smallest to

largest in order of magnitude. The rank

of the ith value of X is denoted by R(Xi)

and R(Xi)=1 if Xi is the smallest

observed value of X

• Follow the same procedure for the Y

variable

Doctor ID Systolic Diastolic

1 141.8 89.7

2 140.2 74.4

3 131.8 83.5

4 132.5 77.8

5 135.7 85.8

6 141.2 86.5

7 143.9 89.4

8 140.2 89.3

9 140.8 88.0

10 131.7 82.2

11 130.8 84.6

12 135.6 84.4

13 143.6 86.3

14 133.2 85.9

Table 1.

Mean blood pressure readings,

millimeters mercury, by doctor.

Doctor ID Systolic Diastolic R(systolic)

11 130.8 84.6 1

10 131.7 82.2 2

3 131.8 83.5 3

4 132.5 77.8 4

14 133.2 85.9 5

12 135.6 84.4 6

5 135.7 85.8 7

2 140.2 74.4 8.5

8 140.2 89.3 8.5

9 140.8 88.0 10

6 141.2 86.5 11

1 141.8 89.7 12

13 143.6 86.3 13

7 143.9 89.4 14

Table 1.

Mean blood pressure readings, millimeters

mercury, by doctor.

Doctor ID Systolic Diastolic R(systolic) R(diastolic)

2 140.2 74.4 8.5 1

4 132.5 77.8 4 2

10 131.7 82.2 2 3

3 131.8 83.5 3 4

12 135.6 84.4 6 5

11 130.8 84.6 1 6

5 135.7 85.8 7 7

14 133.2 85.9 5 8

13 143.6 86.3 13 9

6 141.2 86.5 11 10

9 140.8 88.0 10 11

8 140.2 89.3 8.5 12

7 143.9 89.4 14 13

1 141.8 89.7 12 14

Table 1.

Mean blood pressure readings, millimeters mercury, by

doctor.

Doctor ID Systolic Diastolic R(systolic) R(diastolic)

1 141.8 89.7 12 14

2 140.2 74.4 8.5 1

3 131.8 83.5 3 4

4 132.5 77.8 4 2

5 135.7 85.8 7 7

6 141.2 86.5 11 10

7 143.9 89.4 14 13

8 140.2 89.3 8.5 12

9 140.8 88.0 10 11

10 131.7 82.2 2 3

11 130.8 84.6 1 6

12 135.6 84.4 6 5

13 143.6 86.3 13 9

14 133.2 85.9 5 8

Table 1.

Mean blood pressure readings, millimeters mercury, by

doctor.

Doctor ID Systolic Diastolic R(systolic) R(diastolic)

1 141.8 89.7 12 14

2 140.2 74.4 8.5 1

3 131.8 83.5 3 4

4 132.5 77.8 4 2

5 135.7 85.8 7 7

6 141.2 86.5 11 10

7 143.9 89.4 14 13

8 140.2 89.3 8.5 12

9 140.8 88.0 10 11

10 131.7 82.2 2 3

11 130.8 84.6 1 6

12 135.6 84.4 6 5

13 143.6 86.3 13 9

14 133.2 85.9 5 8

Table 1.

Mean blood pressure readings, millimeters mercury, by

doctor.

Doctor ID Systolic Diastolic R(systolic) R(diastolic) di

1 141.8 89.7 12 14 -2

2 140.2 74.4 8.5 1 7.5

3 131.8 83.5 3 4 -1

4 132.5 77.8 4 2 2

5 135.7 85.8 7 7 0

6 141.2 86.5 11 10 1

7 143.9 89.4 14 13 1

8 140.2 89.3 8.5 12 -3.5

9 140.8 88.0 10 11 -1

10 131.7 82.2 2 3 -1

11 130.8 84.6 1 6 -5

12 135.6 84.4 6 5 1

13 143.6 86.3 13 9 4

14 133.2 85.9 5 8 -3

Table 1.

Mean blood pressure readings, millimeters mercury, by doctor.

Doctor ID Systolic Diastolic R(systolic) R(diastolic) di di2

1 141.8 89.7 12 14 -2 4

2 140.2 74.4 8.5 1 7.5 56.25

3 131.8 83.5 3 4 -1 1

4 132.5 77.8 4 2 2 4

5 135.7 85.8 7 7 0 0

6 141.2 86.5 11 10 1 1

7 143.9 89.4 14 13 1 1

8 140.2 89.3 8.5 12 -3.5 12.25

9 140.8 88.0 10 11 -1 1

10 131.7 82.2 2 3 -1 1

11 130.8 84.6 1 6 -5 25

12 135.6 84.4 6 5 1 1

13 143.6 86.3 13 9 4 16

14 133.2 85.9 5 8 -3 9

ΣΣΣΣdi = 132.50

Table 1.

Mean blood pressure readings, millimeters mercury, by doctor.

Computing Spearman Rho using SPSS

Correlations

1.000 .708**

. .005

14 14

.708** 1.000

.005 .

14 14

Correlation Coefficient

Sig. (2-tailed)

N

Correlation Coefficient

Sig. (2-tailed)

N

SYSTOLIC

DIASTOLI

Spearman's rho

SYSTOLIC DIASTOLI

Correlation is significant at the .01 level (2-tailed).**.

Analyze-Correlate-Bivariate

Kendall’s Tau (ττττ, T, or t)

• Named after Sir Maurice G. Kendall (1907-1983)

• Based on the ranks of observations

• Values range between –1 and +1

• Computation is more tedious than rs • Defined as the probability of concordance minus the probability of discordance.

• Typically will yield a different value than rs

Photo from: http://www.york.ac.uk/depts/maths/histstat/people/sources.htm

To find out more about this statistic, see

http://www2.chass.ncsu.edu/garson/pa765/assocordinal.htm

Comparison of

values for the

Blood Pressure

Data

Correlations

1 .418

. .136

14 14

.418 1

.136 .

14 14

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

SYSTOLIC

DIASTOLI

SYSTOLIC DIASTOLI

Correlations

1.000 .486*

. .016

14 14

.486* 1.000

.016 .

14 14

Correlation Coefficient

Sig. (2-tailed)

N

Correlation Coefficient

Sig. (2-tailed)

N

SYSTOLIC

DIASTOLI

Kendall's tau_b

SYSTOLIC DIASTOLI

Correlation is significant at the .05 level (2-tailed).*.

Correlations

1.000 .708**

. .005

14 14

.708** 1.000

.005 .

14 14

Correlation Coeff icient

Sig. (2-tailed)

N

Correlation Coeff icient

Sig. (2-tailed)

N

SYSTOLIC

DIASTOLI

Spearman's rho

SYSTOLIC DIASTOLI

Correlation is signif icant at the .01 level (2-tailed).**.

The “Pearson Family”

Name Symbol X Y

Pearson Product-moment r Interval/Ratio Interval/Ratio

Spearman rho rs Ordinal Ordinal

Phi Φ True Dichotomous True Dichotomous

Point Biserial rpb True Dichotomous Interval/Ratio

Rank-Biserial rrb True Dichotomous Ordinal

Name Symbol X Y

Kendal's Tau Τ Ordinal Ordinal

Biserial rb Forced Dichotomous Interval/Ratio

Tetrachoric rt Forced Dichotomous Forced Dichotomous

Definitions

Pearson "Family"

Non-Pearson "family"

Types of Correlation Coefficients

Forced Dichtomous: The variable is assumed to have an underlying normal

distribution, but is forced to be a dichotomous variable (e.g. Rich/Poor, Happy/Sad,

Smart/Not Smart, etc.)

True Dichotomous: A variable that is nominal and has only two levels.

From: http://www.oandp.org/jpo/library/1996_03_105.asp

• Nonparametric tests should not be substituted for parametric tests when parametric tests are more appropriate. Nonparametric tests should be used when the assumptions of parametric tests cannot be met, when very small numbers of data are used, and when no basis exists for assuming certain types or shapes of distributions (9).

• Nonparametric tests are used if data can only be classified, counted or ordered-for example, rating staff on performance or comparing results from manual muscle tests. These tests should not be used in determining precision or accuracy of instruments because the tests are lacking in both areas.

From:

http://www.unesco.org/webworld/idams/advguide/Chapt4_2.htm

• Pearson correlation is unduly influenced by outliers, unequal variances, non-normality, and nonlinearity. An important competitor of the Pearson correlation coefficient is the Spearman’s rank correlation coefficient. This latter correlation is calculated by applying the Pearson correlation formula to the ranks of the data rather than to the actual data values themselves. In so doing, many of the distortions that plague the Pearson correlation are reduced considerably.

For more information about the effect of

ties on Spearman Rho, see…

• CONOVER, WJ. Approximations of the

Critical Region for Spearman's Rho With

and Without Ties Present. Communications

in Statistics, Volume B7, No. 3 (1978) (with

R. L. Iman), pp. 269-282..

top related